# <div style = "padding: 20px; border-radius: 5px; background-color: #7C4DFF;color: white"><center>PRE-PROCESSING STAGE-II - DATASET A and B</center></DIV>

## SCOPE

- Processing Dataset A and B. 
- The scope of the 2nd stage of pre-processing is to load the consolidated data from the previous stage of pre-processing, understand data and do some preliminary cleaning, formatting and store them in an accessible format
- Dataset A contains data for all the cities.- Dataset B contains data for all the cities with majority of data from cities - CHENNAI, HYDERABAD, BANGALORE, DELHI, MUMBAI.

## IMPORTING LIBRARIES

In [1]:
import pandas as pd # for processing dataset A
import numpy as np
import polars as pl  # for processing dataset B

## <div style = "padding: 20px; border-radius: 40px; background-color: #40E0D0"> Dataset A - (using PANDAS) </div>

### `Overview` DATA

In [2]:
df_overview_A = pd.read_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-I\overview_A.csv',low_memory=False)

In [3]:
df_overview_A.shape

(112559, 61)

In [4]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 112559 entries, 0 to 112558
Data columns (total 61 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   carId                      112559 non-null  int64  
 1   profileId                  112559 non-null  object 
 2   vehicle                    112559 non-null  object 
 3   city                       112559 non-null  object 
 4   state                      112559 non-null  object 
 5   price                      112559 non-null  object 
 6   color                      112559 non-null  object 
 7   kilometers                 112559 non-null  object 
 8   fuelName                   112559 non-null  object 
 9   transmissionType           112559 non-null  object 
 10  sellerId                   112559 non-null  int64  
 11  insurance                  112233 non-null  object 
 12  insuranceExpiry            31379 non-null   object 
 13  interiorColor              26

#### REMOVING DUPLICATES

In [5]:
# cloning the dataframe from the previous process for contingency
# can be restored from previous process

df_overview_A_1 = df_overview_A.copy()

* `profileId` is the unique id for each car record
* as below each unique id is repeated multiple times

In [6]:
df_overview_A.value_counts(['profileId'])

profileId
D4112795     40
D4111563     40
D4111541     40
D4117045     39
D4111629     37
             ..
S2673651      1
S2673653      1
S2673655      1
S2673671      1
S2819375      1
Name: count, Length: 59708, dtype: int64

* deleting duplicates with respect to profileId

In [7]:
df_overview_A.drop_duplicates(['profileId'],inplace=True)

In [8]:
df_overview_A.shape

(59708, 61)

In [9]:
df_overview_A.value_counts(['profileId'])

profileId
D1820959     1
S2727795     1
S2727753     1
S2727757     1
S2727759     1
            ..
D4162739     1
D4162741     1
D4162751     1
D4162753     1
S2819375     1
Name: count, Length: 59708, dtype: int64

In [10]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59708 entries, 0 to 112558
Data columns (total 61 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   carId                      59708 non-null  int64  
 1   profileId                  59708 non-null  object 
 2   vehicle                    59708 non-null  object 
 3   city                       59708 non-null  object 
 4   state                      59708 non-null  object 
 5   price                      59708 non-null  object 
 6   color                      59708 non-null  object 
 7   kilometers                 59708 non-null  object 
 8   fuelName                   59708 non-null  object 
 9   transmissionType           59708 non-null  object 
 10  sellerId                   59708 non-null  int64  
 11  insurance                  59422 non-null  object 
 12  insuranceExpiry            20574 non-null  object 
 13  interiorColor              219 non-null    object 

#### COLUMN-WISE UNDERSTANDING OF DATA AND CLEANING

In [11]:
# taking a clone after removing duplicates
df_overview_A_2 = df_overview_A.copy()

In [12]:
# checking the kind of data in each column
df_overview_A.head().iloc[:,0:20]

Unnamed: 0,carId,profileId,vehicle,city,state,price,color,kilometers,fuelName,transmissionType,sellerId,insurance,insuranceExpiry,interiorColor,lifeTimeTax,insuranceLink,makeName,modelName,versionName,mainImageUrl
0,1,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,A&N Islands,Andaman Nicobar,75 Lakh,Polar White,6000,Diesel + Diesel,Automatic,2,ThirdParty,01 Oct 2023,,Not Available,/insurance/?car=6681&reg=2022-04-01&utm=musedc...,Mercedes-Benz,GLE,300d 4MATIC LWB,https://imgd.aeplcdn.com/640X480/cw/ucp/s27255...
1,2,S2730115,Maruti Suzuki Vitara Brezza VDi (O) [2016-2018],A&N Islands,Andaman Nicobar,6 Lakh,White,55000,Diesel + Diesel,Manual,2,Comprehensive,01 Aug 2024,,Not Available,/insurance/?car=4635&reg=2016-08-01&utm=musedc...,Maruti Suzuki,Vitara Brezza [2016-2020],VDi (O) [2016-2018],https://imgd.aeplcdn.com/640X480/cw/ucp/s27301...
2,3,S2673115,Maruti Suzuki Vitara Brezza VDi,A&N Islands,Andaman Nicobar,6.5 Lakh,Premium Silver,58000,Diesel + Diesel,Manual,2,Expired,,,Not Available,/insurance/?car=4631&reg=2018-05-01&utm=musedc...,Maruti Suzuki,Vitara Brezza [2016-2020],VDi,
3,4,S2719839,Ford Figo Duratorq Diesel EXI 1.4,A&N Islands,Andaman Nicobar,2.2 Lakh,Diamond White,68000,Diesel + Diesel,Manual,2,Not Available,,,Not Available,/insurance/?car=1736&reg=2011-09-01&utm=musedc...,Ford,Figo [2010-2012],Duratorq Diesel EXI 1.4,https://imgd.aeplcdn.com/640X480/cw/ucp/s27198...
6,7,S2694537,Maruti Suzuki Ritz GENUS VXI,A&N Islands,Andaman Nicobar,2.2 Lakh,Glistening Grey,75000,Petrol,Manual,2,Not Available,,,Not Available,/insurance/?car=1937&reg=2012-03-01&utm=musedc...,Maruti Suzuki,Ritz [2009-2012],GENUS VXI,


In [13]:
# checking the kind of data in each column
df_overview_A.head().iloc[:,20:40]

Unnamed: 0,photoCount,makeYear,makeMonth,cityMaskingName,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,lastUpdatedDate,totalPhotosUploaded,similarCarsUrl,stockRecommendationUrl,cwBasePackageId,ctePackageId,modelId,versionId,registrationNumber,tcStockId,regType
0,13,2022,Apr,anislands,A&N Islands,196,First,Not Available,A&N Islands,1 month(s) ago,13,/api/stocks/S2725575/similarcars,/api/stocks/S2725575/similarcars,0,0,1204,6681,Not Available,0.0,Individual
1,5,2016,Aug,anislands,A&N Islands,196,First,Not Available,A&N Islands,1 month(s) ago,5,/api/stocks/S2730115/similarcars,/api/stocks/S2730115/similarcars,0,0,976,4635,Not Available,0.0,Individual
2,0,2018,May,anislands,A&N Islands,196,First,Not Available,A&N Islands,2 month(s) ago,0,/api/stocks/S2673115/similarcars,/api/stocks/S2673115/similarcars,0,0,976,4631,UP32JY8632,0.0,Corporate
3,5,2011,Sep,anislands,A&N Islands,196,First,Not Available,A&N Islands,1 month(s) ago,5,/api/stocks/S2719839/similarcars,/api/stocks/S2719839/similarcars,0,0,667,1736,Not Available,0.0,
6,0,2012,Mar,anislands,A&N Islands,196,Second,Not Available,"Aberdeen Bazar, A&N Islands",2 month(s) ago,0,/api/stocks/S2694537/similarcars,/api/stocks/S2694537/similarcars,0,0,725,1937,An01E8363,0.0,Individual


In [14]:
# checking the kind of data in each column
df_overview_A.head().iloc[:,40:60]

Unnamed: 0,priceNumeric,isCertified,videoUrl,modelMaskingName,rootName,bodyStyleId,isSellCarOfferAvailable,allowBooking,emiPrice,shouldShowEmiSlug,isHomeTestDrive,formattedRegistrationDate,fuelEconomy,dealershipLogoUrl,isSold,oemVehicleUrl,virtualPhoneNumber,homeTestDriveSlug,formattedOriginalPrice,stockId
0,7500000,False,,gle,GLE,6,False,False,1.25 L,False,False,Not Available,Not Available,,False,,,,,
1,600000,False,,vitara-brezza-2016-2020,Vitara Brezza,6,False,False,9964,False,False,Not Available,Not Available,,False,,,,,
2,650000,False,,vitara-brezza-2016-2020,Vitara Brezza,6,False,False,10794,True,False,Not Available,Not Available,,False,,,,,
3,220000,False,,figo-2010-2012,Figo,3,False,False,3653,False,False,Not Available,Not Available,,False,,,,,
6,220000,False,,ritz-2009-2012,Ritz,3,False,False,3653,False,False,Not Available,Not Available,,False,,,,,


In [15]:
# checking the kind of data in each column
df_overview_A.head().iloc[:,60:70]

Unnamed: 0,additionalFuelName
0,
1,
2,
3,
6,


* after preliminary assessment the below columns seem to be useless for the analysis
* **"carId","sellerId","insuranceExpiry","interiorColor","lifeTimeTax","insuranceLink","mainImageUrl",
                                                "photoCount","lastUpdatedDate","totalPhotosUploaded",
                                                "similarCarsUrl","stockRecommendationUrl","cwBasePackageId","ctePackageId","modelId","versionId",
                                                "registrationNumber","tcStockId","isCertified","videoUrl","modelMaskingName","bodyStyleId",
                                                "isSellCarOfferAvailable","allowBooking","emiPrice","shouldShowEmiSlug","isHomeTestDrive",
                                                "formattedRegistrationDate","fuelEconomy","dealershipLogoUrl","isSold","oemVehicleUrl",
                                                "virtualPhoneNumber","homeTestDriveSlug","formattedOriginalPrice","stockId","additionalFuelName"**

In [16]:
df_overview_A = df_overview_A.drop(columns=["carId","sellerId","insuranceExpiry","interiorColor","lifeTimeTax","insuranceLink","mainImageUrl",
                                                "photoCount","lastUpdatedDate","totalPhotosUploaded",
                                                "similarCarsUrl","stockRecommendationUrl","cwBasePackageId","ctePackageId","modelId","versionId",
                                                "registrationNumber","tcStockId","isCertified","videoUrl","modelMaskingName","bodyStyleId",
                                                "isSellCarOfferAvailable","allowBooking","emiPrice","shouldShowEmiSlug","isHomeTestDrive",
                                                "formattedRegistrationDate","fuelEconomy","dealershipLogoUrl","isSold","oemVehicleUrl",
                                                "virtualPhoneNumber","homeTestDriveSlug","formattedOriginalPrice","stockId","additionalFuelName"]
                                  )

#"cityMaskingName","cityId","carAvailbaleAt"

In [17]:
# after deleting columns
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59708 entries, 0 to 112558
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   profileId         59708 non-null  object
 1   vehicle           59708 non-null  object
 2   city              59708 non-null  object
 3   state             59708 non-null  object
 4   price             59708 non-null  object
 5   color             59708 non-null  object
 6   kilometers        59708 non-null  object
 7   fuelName          59708 non-null  object
 8   transmissionType  59708 non-null  object
 9   insurance         59422 non-null  object
 10  makeName          59708 non-null  object
 11  modelName         59708 non-null  object
 12  versionName       59708 non-null  object
 13  makeYear          59708 non-null  int64 
 14  makeMonth         59708 non-null  object
 15  cityMaskingName   59694 non-null  object
 16  cityName          59708 non-null  object
 17  cityId          

In [18]:
df_overview_A.shape

(59708, 24)

- finding the count of unique items for each column

In [19]:
# checking the count of unique values to analyse the values in each column
df_overview_A.nunique()

profileId           59708
vehicle              5424
city                 1030
state                  35
price                2332
color                1193
kilometers          14254
fuelName               38
transmissionType       62
insurance               6
makeName               46
modelName             942
versionName          4624
makeYear               33
makeMonth              12
cityMaskingName      1034
cityName             1030
cityId               1033
noOfOwners              6
registerCity          377
carAvailbaleAt       8402
regType                 3
priceNumeric         4268
rootName              404
dtype: int64

* checking anamolies for each column

##### `profileId`

In [20]:

# first clearing out the leading and trailing spaces

df_overview_A['profileId'] = df_overview_A['profileId'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['profileId'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['profileId'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['profileId']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['profileId']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['profileId']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['profileId']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['profileId']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['profileId']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [21]:
# checking profileId for any irregularities in the values
# the data seems to be consistent
# profileId starts with either S or D, so the data here is consistent

set(pd.Series(df_overview_A['profileId'].unique()).str.slice(stop = 2))

{'D1', 'D2', 'D3', 'D4', 'S2'}

##### `vehicle`

In [22]:
# first clearing out the leading and trailing spaces

df_overview_A['vehicle'] = df_overview_A['vehicle'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['vehicle'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['vehicle'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['vehicle']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['vehicle']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['vehicle']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['vehicle']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['vehicle']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['vehicle']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [23]:
# data is consistent here as well
# will not be specifically used for analysis, but is kept to retain the original description as shown in the website

df_overview_A['vehicle'].str.partition(sep=' ')[0].unique()

array(['Mercedes-Benz', 'Maruti', 'Ford', 'Volkswagen', 'Nissan',
       'Hyundai', 'Renault', 'Honda', 'Toyota', 'Skoda', 'Tata',
       'Mahindra', 'MG', 'Audi', 'BMW', 'Land', 'Jaguar', 'Volvo',
       'Porsche', 'Kia', 'Jeep', 'MINI', 'Chevrolet', 'Force',
       'Mitsubishi', 'Fiat', 'Datsun', 'Isuzu', 'Ssangyong', 'Maserati',
       'Lexus', 'Mahindra-Renault', 'Bentley', 'Aston', 'Citroen',
       'Cadillac', 'Chrysler', 'ICML', 'Rolls-Royce', 'Hindustan',
       'Hummer', 'Lamborghini', 'Ferrari', 'Opel', 'Ashok', 'Sipani'],
      dtype=object)

In [24]:
# removing leading/trailing spaces if any
df_overview_A['vehicle'] = df_overview_A['vehicle'].str.strip()

##### `cityName`, `city`, `state`, `cityMaskingName`,`cityId` and `carAvailbaleAt`

In [25]:
# first clearing out the leading and trailing spaces

df_overview_A['city'] = df_overview_A['city'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['city'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['city'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['city']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['city']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['city']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['city']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['city']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['city']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [26]:
# first clearing out the leading and trailing spaces

df_overview_A['cityName'] = df_overview_A['cityName'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['cityName'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['cityName'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['cityName']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['cityName']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['cityName']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['cityName']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['cityName']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['cityName']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  2 
 None -  0 
 0 -  0 
 0(str) -  0 



In [27]:
# checking if city and cityName are the same

print(df_overview_A[df_overview_A['cityName'] == df_overview_A['city']].shape,
      df_overview_A[df_overview_A['cityName'] != df_overview_A['city']].shape)

(44470, 24) (15238, 24)


* around 15238 records have incorrrect cities
* so cityName should be the correct city since, this data was embedded in the source extracted json
* there are to Not Applicable values in the cityName, which can be filled using cityMaskingName or carAvailableAt data as all the 3 fields will have similar values

In [28]:
df_overview_A[df_overview_A['cityName'] == 'Not Available']

Unnamed: 0,profileId,vehicle,city,state,price,color,kilometers,fuelName,transmissionType,insurance,...,makeMonth,cityMaskingName,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,regType,priceNumeric,rootName
46797,S2751753,Chevrolet Spark LT 1.0,Gulbarga,Karnataka,1.4 Lakh,Sanddrift Grey,78000,Petrol,Manual - 5 Gears,Third Party,...,Dec,Not Available,Not Available,52,Third,Not Available,Not Available,Individual,140000,Spark
46820,S2789201,Kia Sonet GTX Plus 1.5,Gulbarga,Karnataka,15 Lakh,Blue,6000,Diesel,Manual - 6 Gears,,...,Aug,Not Available,Not Available,52,First,Not Available,Not Available,,1500000,Sonet


* since the city data is not available 'city' column value can be used

In [29]:
df_overview_A['cityName'].replace('Not Available','Gulbarga',inplace = True)

In [30]:
print(df_overview_A[df_overview_A['cityName'] == df_overview_A['city']].shape,
      df_overview_A[df_overview_A['cityName'] != df_overview_A['city']].shape)

(44472, 24) (15236, 24)


In [31]:
df_overview_A[df_overview_A['cityName'] == 'Not Available']

Unnamed: 0,profileId,vehicle,city,state,price,color,kilometers,fuelName,transmissionType,insurance,...,makeMonth,cityMaskingName,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,regType,priceNumeric,rootName


* 'Not Available' data was replaced with Gulbarga
* Now looking into state and how incorrect/unavailable cities can be updated with the correct state-names

* creating a new column and updated the state names of correct cities first.
* then the unavailable state names for other cities can be extracted from CHATGPT and updated.

In [32]:
list(df_overview_A['state'].unique())

['Andaman Nicobar',
 'Punjab',
 'Rajasthan',
 'Telangana',
 'Kerala',
 'Andhra Pradesh',
 'Madhya Pradesh',
 'Tripura',
 'Uttar Pradesh',
 'Gujarat',
 'Maharashtra',
 'Mizoram',
 'West Bengal',
 'Uttarakhand',
 'Haryana',
 'Tamil Nadu',
 'Chhattisgarh',
 'Jammu & Kashmir',
 'Karnataka',
 'Orissa',
 'Bihar',
 'Himachal Pradesh',
 'Jharkhand',
 'Arunachal Pradesh',
 'Goa',
 'Assam',
 'Sikkim',
 'Meghalaya',
 'Nagaland',
 'Daman & Diu',
 'Manipur',
 'Pondicherry',
 'Chandigarh',
 'Dadra and Nagar Haveli',
 'Delhi']

In [33]:
# first clearing out the leading and trailing spaces

df_overview_A['state'] = df_overview_A['state'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['state'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['state'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['state']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['state']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['state']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['state']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['state']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['state']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [34]:
# comparing the states with the actual/available states in India
# Corretion needed for "Andaman and Nicobar"
# Ladakh and Lakshadweep are not available

states_india = ["Andhra Pradesh", "Arunachal Pradesh", "Assam", "Bihar", "Chhattisgarh", 
                "Goa", "Gujarat", "Haryana", "Himachal Pradesh", "Jharkhand", "Karnataka", 
                "Kerala", "Madhya Pradesh", "Maharashtra", "Manipur", "Meghalaya", "Mizoram", 
                "Nagaland", "Orissa", "Punjab", "Rajasthan", "Sikkim", "Tamil Nadu", "Telangana", 
                "Tripura", "Uttar Pradesh", "Uttarakhand", "West Bengal","Andaman and Nicobar Islands", "Chandigarh", 
                "Dadra and Nagar Haveli", "Daman & Diu", "Lakshadweep", "Delhi", 
                "Pondicherry", "Jammu & Kashmir", "Ladakh"]

set(states_india).difference(set(df_overview_A['state'].unique()))

{'Andaman and Nicobar Islands', 'Ladakh', 'Lakshadweep'}

In [35]:
# Changing the state name to "Andaman and Nicobar" from "Andaman Nicobar"

df_overview_A['state'].replace('Andaman Nicobar','Andaman and Nicobar Islands',inplace=True)
df_overview_A['state'].unique()

array(['Andaman and Nicobar Islands', 'Punjab', 'Rajasthan', 'Telangana',
       'Kerala', 'Andhra Pradesh', 'Madhya Pradesh', 'Tripura',
       'Uttar Pradesh', 'Gujarat', 'Maharashtra', 'Mizoram',
       'West Bengal', 'Uttarakhand', 'Haryana', 'Tamil Nadu',
       'Chhattisgarh', 'Jammu & Kashmir', 'Karnataka', 'Orissa', 'Bihar',
       'Himachal Pradesh', 'Jharkhand', 'Arunachal Pradesh', 'Goa',
       'Assam', 'Sikkim', 'Meghalaya', 'Nagaland', 'Daman & Diu',
       'Manipur', 'Pondicherry', 'Chandigarh', 'Dadra and Nagar Haveli',
       'Delhi'], dtype=object)

In [36]:
#creating an empty column
df_overview_A['state_2'] = None

In [37]:
df_overview_A.head(1)

Unnamed: 0,profileId,vehicle,city,state,price,color,kilometers,fuelName,transmissionType,insurance,...,cityMaskingName,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,regType,priceNumeric,rootName,state_2
0,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,A&N Islands,Andaman and Nicobar Islands,75 Lakh,Polar White,6000,Diesel + Diesel,Automatic,ThirdParty,...,anislands,A&N Islands,196,First,Not Available,A&N Islands,Individual,7500000,GLE,


In [38]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59708 entries, 0 to 112558
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   profileId         59708 non-null  object
 1   vehicle           59708 non-null  object
 2   city              59708 non-null  object
 3   state             59708 non-null  object
 4   price             59708 non-null  object
 5   color             59708 non-null  object
 6   kilometers        59708 non-null  object
 7   fuelName          59708 non-null  object
 8   transmissionType  59708 non-null  object
 9   insurance         59422 non-null  object
 10  makeName          59708 non-null  object
 11  modelName         59708 non-null  object
 12  versionName       59708 non-null  object
 13  makeYear          59708 non-null  int64 
 14  makeMonth         59708 non-null  object
 15  cityMaskingName   59694 non-null  object
 16  cityName          59708 non-null  object
 17  cityId          

In [39]:
# updating existing states to the available/matching cities
def update_states(row):
    if(row[2] == row[16]):
        row[24] = row[3]
    else:
        row[24] = None
    return row

In [40]:
df_overview_A = df_overview_A.apply(update_states, axis = 1)

In [41]:
# checking the cities for which states were not updated or available
df_overview_A[df_overview_A['state_2'].isnull()].cityName.unique()

array(['Hyderabad', 'Lucknow', 'Kanpur', 'Delhi', 'Pune', 'Mumbai',
       'Jaipur', 'Nagpur', 'Navi Mumbai', 'Agra', 'Faizabad', 'Meerut',
       'Haldwani', 'Mohali', 'Chandigarh', 'Ludhiana', 'Jalandhar',
       'Kharar', 'Zirakpur', 'Ahmedabad', 'Vadodara', 'Surat', 'Patna',
       'Muzaffurpur', 'Purnea', 'Kolkata', 'Ara', 'Nashik', 'Thane',
       'Bangalore', 'Gurgaon', 'Mangalore', 'Dehradun', 'Udupi',
       'Dak. Kannada', 'Kuchaman', 'Ranchi', 'Bhagalpur', 'Chennai',
       'Coimbatore', 'Raipur', 'Siliguri', 'Karnal', 'Goa', 'Nagaon',
       'Kalyan', 'Badlapur', 'Kollam', 'Nellore', 'Kota', 'Bhubaneswar',
       'Indore', 'Ghaziabad', 'Noida', 'Rajkot', 'Tezpur', 'Rudrapur',
       'Faridabad', 'Jamshedpur', 'Kharagpur', 'Malappuram',
       'Kurukshetra', 'Gorakhpur', 'Aurangabad', 'Kolhapur', 'Bhopal',
       'Alwar', 'Dharamshala'], dtype=object)

In [42]:
list(df_overview_A[df_overview_A['state_2'].isnull()].cityName.unique())

['Hyderabad',
 'Lucknow',
 'Kanpur',
 'Delhi',
 'Pune',
 'Mumbai',
 'Jaipur',
 'Nagpur',
 'Navi Mumbai',
 'Agra',
 'Faizabad',
 'Meerut',
 'Haldwani',
 'Mohali',
 'Chandigarh',
 'Ludhiana',
 'Jalandhar',
 'Kharar',
 'Zirakpur',
 'Ahmedabad',
 'Vadodara',
 'Surat',
 'Patna',
 'Muzaffurpur',
 'Purnea',
 'Kolkata',
 'Ara',
 'Nashik',
 'Thane',
 'Bangalore',
 'Gurgaon',
 'Mangalore',
 'Dehradun',
 'Udupi',
 'Dak. Kannada',
 'Kuchaman',
 'Ranchi',
 'Bhagalpur',
 'Chennai',
 'Coimbatore',
 'Raipur',
 'Siliguri',
 'Karnal',
 'Goa',
 'Nagaon',
 'Kalyan',
 'Badlapur',
 'Kollam',
 'Nellore',
 'Kota',
 'Bhubaneswar',
 'Indore',
 'Ghaziabad',
 'Noida',
 'Rajkot',
 'Tezpur',
 'Rudrapur',
 'Faridabad',
 'Jamshedpur',
 'Kharagpur',
 'Malappuram',
 'Kurukshetra',
 'Gorakhpur',
 'Aurangabad',
 'Kolhapur',
 'Bhopal',
 'Alwar',
 'Dharamshala']

In [43]:
# getting the state data of the above cities from CHAT GPT

city_state_missing = pd.read_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\01.Data Extraction\city_state_missing.csv')

In [44]:
city_state_missing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68 entries, 0 to 67
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   cityName    68 non-null     object
 1   state_name  68 non-null     object
dtypes: object(2)
memory usage: 1.2+ KB


In [45]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
Index: 59708 entries, 0 to 112558
Data columns (total 25 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   profileId         59708 non-null  object
 1   vehicle           59708 non-null  object
 2   city              59708 non-null  object
 3   state             59708 non-null  object
 4   price             59708 non-null  object
 5   color             59708 non-null  object
 6   kilometers        59708 non-null  object
 7   fuelName          59708 non-null  object
 8   transmissionType  59708 non-null  object
 9   insurance         59422 non-null  object
 10  makeName          59708 non-null  object
 11  modelName         59708 non-null  object
 12  versionName       59708 non-null  object
 13  makeYear          59708 non-null  int64 
 14  makeMonth         59708 non-null  object
 15  cityMaskingName   59694 non-null  object
 16  cityName          59708 non-null  object
 17  cityId          

In [46]:
# merging the missing details with the main dataset for update

df_overview_A = df_overview_A.merge(city_state_missing, on ='cityName',how='left')

In [47]:
df_overview_A.head(2)

Unnamed: 0,profileId,vehicle,city,state,price,color,kilometers,fuelName,transmissionType,insurance,...,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,regType,priceNumeric,rootName,state_2,state_name
0,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,A&N Islands,Andaman and Nicobar Islands,75 Lakh,Polar White,6000,Diesel + Diesel,Automatic,ThirdParty,...,A&N Islands,196,First,Not Available,A&N Islands,Individual,7500000,GLE,Andaman and Nicobar Islands,
1,S2730115,Maruti Suzuki Vitara Brezza VDi (O) [2016-2018],A&N Islands,Andaman and Nicobar Islands,6 Lakh,White,55000,Diesel + Diesel,Manual,Comprehensive,...,A&N Islands,196,First,Not Available,A&N Islands,Individual,600000,Vitara Brezza,Andaman and Nicobar Islands,


In [48]:
df_overview_A[df_overview_A['state_2'].isnull()]

Unnamed: 0,profileId,vehicle,city,state,price,color,kilometers,fuelName,transmissionType,insurance,...,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,regType,priceNumeric,rootName,state_2,state_name
90,D4053815,Mercedes-Benz C-Class C 220d,Adoni,Andhra Pradesh,60 Lakh,Silver,4000,Diesel,Automatic (TC),Not Available,...,Hyderabad,105,UnRegistered Car,Not Available,"Madhapur, Hyderabad",,6000000,C-Class,,Telangana
91,D4050997,Mercedes-Benz C-Class C 300d,Adoni,Andhra Pradesh,65 Lakh,White,3500,Diesel,Automatic,Not Available,...,Hyderabad,105,UnRegistered Car,Not Available,"Madhapur, Hyderabad",,6500000,C-Class,,Telangana
92,D4109997,Mercedes-Benz EQB 300 4MATIC,Adoni,Andhra Pradesh,74.5 Lakh,White,1500,Electric,Automatic,Not Available,...,Hyderabad,105,UnRegistered Car,Not Available,"Madhapur, Hyderabad",,7450000,EQB,,Telangana
93,D4050999,Mercedes-Benz C-Class C 220d,Adoni,Andhra Pradesh,69 Lakh,White,3500,Diesel,Automatic,Not Available,...,Hyderabad,105,UnRegistered Car,Not Available,"Madhapur, Hyderabad",,6900000,C-Class,,Telangana
94,D4051003,Mercedes-Benz E-Class E 200 Exclusive [2019-2019],Adoni,Andhra Pradesh,72 Lakh,Black,2500,Petrol,Automatic,Not Available,...,Hyderabad,105,UnRegistered Car,Not Available,"Madhapur, Hyderabad",,7200000,E-Class,,Telangana
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59699,S2787339,Toyota Innova 2.5 E 7 STR,Dharamsala,Himachal Pradesh,4.5 Lakh,White,335000,Diesel,Manual,Not Available,...,Dharamshala,599,Second,Not Available,Dharamshala,Individual,450000,Innova,,Himachal Pradesh
59700,S2703773,Hyundai Eon Magna +,Dharamsala,Himachal Pradesh,2.7 Lakh,Sleek Silver,90000,Petrol,Manual,Comprehensive,...,Dharamshala,599,First,Not Available,Dharamshala,Individual,270000,Eon,,Himachal Pradesh
59701,S2712695,Ford EcoSport Titanium + 1.5L TDCi,Dharamsala,Himachal Pradesh,6.5 Lakh,Moondust Silver,60000,Diesel,Manual,Not Available,...,Dharamshala,599,First,Not Available,Dharamshala,,650000,Ecosport,,Himachal Pradesh
59702,S2717599,Mahindra XUV500 W10 AWD,Dharamsala,Himachal Pradesh,8.3 Lakh,Moondust Silver,135000,Diesel,Manual,Not Available,...,Dharamshala,599,First,Not Available,Dharamshala,,830192,XUV500,,Himachal Pradesh


In [49]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 26 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   profileId         59708 non-null  object
 1   vehicle           59708 non-null  object
 2   city              59708 non-null  object
 3   state             59708 non-null  object
 4   price             59708 non-null  object
 5   color             59708 non-null  object
 6   kilometers        59708 non-null  object
 7   fuelName          59708 non-null  object
 8   transmissionType  59708 non-null  object
 9   insurance         59422 non-null  object
 10  makeName          59708 non-null  object
 11  modelName         59708 non-null  object
 12  versionName       59708 non-null  object
 13  makeYear          59708 non-null  int64 
 14  makeMonth         59708 non-null  object
 15  cityMaskingName   59694 non-null  object
 16  cityName          59708 non-null  object
 17  cityId      

In [50]:
# creating a custom function to update the missing state value
def update_missing_state(row):
    if row[24] == None:
        row[24] = row[25]
    else:
        pass
    return row

In [51]:
# updating all the missing states

df_overview_A = df_overview_A.apply(update_missing_state,axis = 1)

In [52]:
# checking if all the states have been updated
df_overview_A[df_overview_A['state_2'].isnull()]

Unnamed: 0,profileId,vehicle,city,state,price,color,kilometers,fuelName,transmissionType,insurance,...,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,regType,priceNumeric,rootName,state_2,state_name


In [53]:
# checking if city and state contains any null values

print( "null - ", df_overview_A[df_overview_A['state_2'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['state_2'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['state_2']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['state_2']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['state_2']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['state_2']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['state_2']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['state_2']== '0'].shape[0],'\n'
     )

print( "null - ", df_overview_A[df_overview_A['cityName'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['cityName'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['cityName']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['cityName']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['cityName']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['cityName']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['cityName']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['cityName']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* no null or empty values
* deleting all the unrequired columns from the dataframe
    * city, state, cityMaskingName ,cityId, state_name

In [54]:
df_overview_A.drop(columns = ['city', 'state', 'cityMaskingName' ,'cityId', 'state_name'],inplace = True)

* renaming the columns cityName and state_2

In [55]:
df_overview_A.rename(columns = {'cityName':'city','state_2':'state'},inplace = True)

In [56]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   profileId         59708 non-null  object
 1   vehicle           59708 non-null  object
 2   price             59708 non-null  object
 3   color             59708 non-null  object
 4   kilometers        59708 non-null  object
 5   fuelName          59708 non-null  object
 6   transmissionType  59708 non-null  object
 7   insurance         59422 non-null  object
 8   makeName          59708 non-null  object
 9   modelName         59708 non-null  object
 10  versionName       59708 non-null  object
 11  makeYear          59708 non-null  int64 
 12  makeMonth         59708 non-null  object
 13  city              59708 non-null  object
 14  noOfOwners        59708 non-null  object
 15  registerCity      59708 non-null  object
 16  carAvailbaleAt    59708 non-null  object
 17  regType     

##### `color`

In [57]:

df_overview_A['color'] = df_overview_A['color'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['color'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['color'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['color']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['color']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['color']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['color']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['color']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['color']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [58]:
# checking color column
# the color values are not consistent

array_clr = df_overview_A['color'].unique()
array_clr.sort()
list(array_clr)

['1149000',
 'A3dmndwhit',
 'ALP Blue',
 'Absolute Black',
 'Abyss Black',
 'Acquamarine',
 'Agate Red',
 'Akbaraliisonlin',
 'Alabaster Silver',
 'Alabaster Silver Metallic',
 'Alabastor Silver Metallic',
 'Alp Blue',
 'Alpha Blue',
 'Alpine Blue',
 'Alpine White',
 'Amalfi White',
 'Amazon Blue',
 'Amazon Green',
 'Amber Orange',
 'Amethyst Royal',
 'Anthracite Grey',
 'Anyway',
 'Aodmndwhit',
 'Apine White Non Metallic',
 'Apple Red',
 'Aqua Marine',
 'Aqua Marine, Pearl White',
 'Aqua Mist',
 'Aqua Rush',
 'Aqua Teal',
 'Aqua Tint',
 'Aquamarine',
 'Aquateal',
 'Aquatel',
 'Arcade Grey',
 'Arcitc White',
 'Arctic Blue',
 'Arctic Silver',
 'Arctic White',
 'Arctic White + Black',
 'Arizona Blue',
 'Artic Silver',
 'Artic White',
 'Ash',
 'Asmani',
 'Astern Black',
 'Atlantic Blue',
 'Atlantis Bule',
 'Atlas Black',
 'Atlas White',
 'Atlas White With Abyss Black Roof',
 'Atlas White With Black Roof',
 'Atomic Orange With Black Roof',
 'Attitude Black',
 'Attitude Black Mica',
 'Aubur

* checking if basic colors could be separated from the lot

In [59]:
# first clearing out the leading and trailing spaces
df_overview_A['color'] = df_overview_A['color'].str.strip()

In [60]:
# splitting the colors and exapnding them into separate columns

df_overview_A[['color_1','color_2','color_3','color_4','color_5','color_6','color_7','color_8','color_9']] = df_overview_A['color'].str.split(' ',expand=True)

In [61]:
df_overview_A[['color_1','color_2','color_3','color_4','color_5','color_6','color_7','color_8','color_9']]

Unnamed: 0,color_1,color_2,color_3,color_4,color_5,color_6,color_7,color_8,color_9
0,Polar,White,,,,,,,
1,White,,,,,,,,
2,Premium,Silver,,,,,,,
3,Diamond,White,,,,,,,
4,Glistening,Grey,,,,,,,
...,...,...,...,...,...,...,...,...,...
59703,Sleek,Silver,,,,,,,
59704,New,Pearl,White,,,,,,
59705,Caribbean,Blue,Metallic,,,,,,
59706,Diamond,White,,,,,,,


In [62]:
# clearing out leading and trailing spaces

df_overview_A['color_1']= df_overview_A['color_1'].str.strip()
df_overview_A['color_2']= df_overview_A['color_2'].str.strip()
df_overview_A['color_3']= df_overview_A['color_3'].str.strip()
df_overview_A['color_4']= df_overview_A['color_4'].str.strip()
df_overview_A['color_5']= df_overview_A['color_5'].str.strip()
df_overview_A['color_6']= df_overview_A['color_6'].str.strip()
df_overview_A['color_7']= df_overview_A['color_7'].str.strip()
df_overview_A['color_8']= df_overview_A['color_8'].str.strip()
df_overview_A['color_9']= df_overview_A['color_9'].str.strip()

In [63]:
df_overview_A['color_1'].unique()

array(['Polar', 'White', 'Premium', 'Diamond', 'Glistening', 'Deep',
       'Blade', 'Solid', 'Sleek', 'Pearl', 'Granite', 'Royal', 'Red',
       'Golden', 'Pure', 'Superior', 'Silky', 'Chill', 'Symphony',
       'Arctic', 'Coral', 'Champagne', 'Purple', 'Brilliant', 'Superio',
       'Blue', 'Metallic', 'Tafeta', 'Dusky', 'Magma', 'Napoli', 'Modern',
       'Uata', 'Black', 'Moondust', 'Porcelain', 'Panther', 'Flash',
       'Espresso', 'Venetian', 'Bright', 'Nexa', 'Speedy', 'Phantom',
       'Radiant', 'Marine', 'Silver', 'Bronze', 'Grey', 'Fiery', 'Breeze',
       'Wine', 'Colorada', 'Ivory', 'Block', 'Meteor', 'Calgary', 'Mint',
       'Aqua', 'Midnight', 'Lunar', 'Grandeur', 'Other', 'Blazing',
       'Titan', 'Others', 'Brown', 'Gold', 'Green', 'Ice', 'Orchid',
       'Crystal', 'Pearlescent', 'Super', 'Urban', 'Oyster', 'Dazzling',
       'Bakers', 'Candy', 'Sea', 'Switchblade', 'Canyon', 'Stone',
       'Enerqetic', 'Gray', 'Opal', 'Typhoon', 'Twilight', 'Orcus',
       'Torna

* the color column values are still inconsistent and are not suitable for analysis
* deleting the color column

In [64]:
df_overview_A.drop(columns = ['color_1','color_2','color_3','color_4','color_5','color_6','color_7','color_8','color_9'],inplace = True)

In [65]:
df_overview_A.drop(columns = ['color'],inplace = True)

In [66]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   profileId         59708 non-null  object
 1   vehicle           59708 non-null  object
 2   price             59708 non-null  object
 3   kilometers        59708 non-null  object
 4   fuelName          59708 non-null  object
 5   transmissionType  59708 non-null  object
 6   insurance         59422 non-null  object
 7   makeName          59708 non-null  object
 8   modelName         59708 non-null  object
 9   versionName       59708 non-null  object
 10  makeYear          59708 non-null  int64 
 11  makeMonth         59708 non-null  object
 12  city              59708 non-null  object
 13  noOfOwners        59708 non-null  object
 14  registerCity      59708 non-null  object
 15  carAvailbaleAt    59708 non-null  object
 16  regType           25222 non-null  object
 17  priceNumeric

##### `kilometers`

In [67]:
# first clearing out the leading and trailing spaces

df_overview_A['kilometers'] = df_overview_A['kilometers'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['kilometers'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['kilometers'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['kilometers']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['kilometers']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['kilometers']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['kilometers']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['kilometers']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['kilometers']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  12 



* the kilometers data is in string and we can see a lot of '0' in the data.
* converting kilometers to float

In [68]:
# removing comma ',' 
df_overview_A['kilometers'].replace(',','',inplace=True,regex=True)

In [69]:
df_overview_A['kilometers'] = df_overview_A['kilometers'].astype('float64',copy=False)

In [70]:
df_overview_A['kilometers'].head()

0     6000.0
1    55000.0
2    58000.0
3    68000.0
4    75000.0
Name: kilometers, dtype: float64

In [71]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   profileId         59708 non-null  object 
 1   vehicle           59708 non-null  object 
 2   price             59708 non-null  object 
 3   kilometers        59708 non-null  float64
 4   fuelName          59708 non-null  object 
 5   transmissionType  59708 non-null  object 
 6   insurance         59422 non-null  object 
 7   makeName          59708 non-null  object 
 8   modelName         59708 non-null  object 
 9   versionName       59708 non-null  object 
 10  makeYear          59708 non-null  int64  
 11  makeMonth         59708 non-null  object 
 12  city              59708 non-null  object 
 13  noOfOwners        59708 non-null  object 
 14  registerCity      59708 non-null  object 
 15  carAvailbaleAt    59708 non-null  object 
 16  regType           25222 non-null  object

* some of the kilometer values are  0
* these records will be filtered or handled during the analysis

In [72]:
df_overview_A[df_overview_A['kilometers']==0].shape

(12, 20)

##### `fuelName`

In [73]:
# first clearing out the leading and trailing spaces

df_overview_A['fuelName'] = df_overview_A['fuelName'].str.strip()

# checking inconsistencies and null values


print( "null - ", df_overview_A[df_overview_A['fuelName'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['fuelName'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['fuelName']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['fuelName']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['fuelName']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['fuelName']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['fuelName']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['fuelName']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  6 
 None -  0 
 0 -  0 
 0(str) -  0 



* there are a few of Not Applicable values
* fuel name is also not consistent
* this will be analysed along with specifications column, because specification column also contain fuel.

In [74]:
df_overview_A['fuelName'].unique()

array(['Diesel + Diesel', 'Petrol', 'Petrol + LPG', 'Diesel', 'CNG + CNG',
       'Petrol + Petrol', 'Diesel + Petrol', 'Petrol + CNG', 'LPG + LPG',
       'Petrol + Diesel', 'Diesel + LPG', 'Diesel + CNG', 'Electric',
       'CNG', 'LPG', 'Hybrid', 'CNG + Petrol+Cng', 'LPG + Petrol+Lpg',
       'CNG + LPG', 'Mild Hybrid(Electric + Petrol)', 'CNG + Petrol',
       'Electric + Electric', 'LPG + CNG', 'Electric + CNG',
       'Electric + LPG', 'Hybrid + Hybrid(Ele',
       'Mild Hybrid(Electric + Petrol) + Mild Hybri',
       'Mild Hybrid(Electric + Petrol) + CNG',
       'Mild Hybrid(Electric + Petrol) + Hybrid(Ele',
       'Petrol + Petrol+Cng', 'Petrol + Electric', 'Not Available',
       'Petrol + Hybrid(Ele', 'Hybrid (Electric + Petrol)',
       'Hybrid + Petrol', 'Mild Hybrid (Electric + Diesel)',
       'Plug-in Hybrid (Electric + Petrol) + CNG',
       'Mild Hybrid(Electric + Petrol) + LPG'], dtype=object)

##### `transmissionType`

In [75]:
# first clearing out the leading and trailing spaces

df_overview_A['transmissionType'] = df_overview_A['transmissionType'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['transmissionType'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['transmissionType'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['transmissionType']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['transmissionType']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['transmissionType']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['transmissionType']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['transmissionType']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['transmissionType']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  6 
 None -  0 
 0 -  0 
 0(str) -  0 



In [76]:
df_overview_A[df_overview_A['transmissionType'] == 'Not Available'].shape

(6, 20)

* transmission type data is also not consistent will be analysed along with specifications

In [77]:
df_overview_A['transmissionType'].unique()

array(['Automatic', 'Manual', 'Automatic (AMT)', 'Automatic (CVT)',
       'Automatic (TC)', 'Automatic (DCT)', 'Clutchless Manual (IMT)',
       'Manual - 5 Gears', 'Manual - 6 Gears', 'Automatic (e-CVT)',
       'Automatic - 8 Gears, Paddle Shift, Sport Mode',
       'Automatic - 8 Gears, Manual Override',
       'Automatic (TC) - 8 Gears, Manual Override, Sport Mode',
       'Automatic (TC) - 8 Gears, Manual Override & Paddle Shift, Sport Mode',
       'Automatic - 9 Gears, Paddle Shift, Sport Mode',
       'Automatic (CVT) - CVT Gears, Sport Mode',
       'Automatic (DCT) - 7 Gears, Manual Override, Sport Mode',
       'Automatic - 8 Gears, Manual Override & Paddle Shift, Sport Mode',
       'Automatic - 7 Gears, Paddle Shift, Sport Mode',
       'Automatic (TC) - 9 Gears, Paddle Shift, Sport Mode',
       'Automatic - 6 Gears, Manual Override, Sport Mode',
       'Automatic - 8 Gears, Sport Mode',
       'Automatic - 6 Gears, Sport Mode', 'Not Available',
       'Automatic - 6 Gea

* grouping the transmission types into 2 major categories
    * Automatic - includes all types - AMT, CVT, IMT, TC, DCT
    * Manual

In [78]:
df_overview_A['transmission'] = df_overview_A['transmissionType'].str.split(' ', n=1,expand=True)[0]

In [79]:
# grouping into 2 major 
df_overview_A['transmission'] = df_overview_A['transmission'].str.replace(',','',regex = True).replace('Not',None,regex=True).replace('Clutchless','Automatic',regex=True).replace('AMT','Automatic',regex=True)

In [80]:
# removing tranmission type
df_overview_A.drop(columns = ['transmissionType'],inplace = True)

In [81]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 20 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   profileId       59708 non-null  object 
 1   vehicle         59708 non-null  object 
 2   price           59708 non-null  object 
 3   kilometers      59708 non-null  float64
 4   fuelName        59708 non-null  object 
 5   insurance       59422 non-null  object 
 6   makeName        59708 non-null  object 
 7   modelName       59708 non-null  object 
 8   versionName     59708 non-null  object 
 9   makeYear        59708 non-null  int64  
 10  makeMonth       59708 non-null  object 
 11  city            59708 non-null  object 
 12  noOfOwners      59708 non-null  object 
 13  registerCity    59708 non-null  object 
 14  carAvailbaleAt  59708 non-null  object 
 15  regType         25222 non-null  object 
 16  priceNumeric    59708 non-null  int64  
 17  rootName        59708 non-null 

##### `makeName`

In [82]:
# first clearing out the leading and trailing spaces

df_overview_A['makeName'] = df_overview_A['makeName'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['makeName'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['makeName'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['makeName']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['makeName']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['makeName']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['makeName']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['makeName']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['makeName']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [83]:
df_overview_A['makeName'].unique()

array(['Mercedes-Benz', 'Maruti Suzuki', 'Ford', 'Volkswagen', 'Nissan',
       'Hyundai', 'Renault', 'Honda', 'Toyota', 'Skoda', 'Tata',
       'Mahindra', 'MG', 'Audi', 'BMW', 'Land Rover', 'Jaguar', 'Volvo',
       'Porsche', 'Kia', 'Jeep', 'MINI', 'Chevrolet', 'Force Motors',
       'Mitsubishi', 'Fiat', 'Datsun', 'Isuzu', 'Ssangyong', 'Maserati',
       'Lexus', 'Mahindra-Renault', 'Bentley', 'Aston Martin', 'Citroen',
       'Cadillac', 'Chrysler', 'ICML', 'Rolls-Royce', 'Hindustan Motors',
       'Hummer', 'Lamborghini', 'Ferrari', 'Opel', 'Ashok Leyland',
       'Sipani'], dtype=object)

* Make Names are consistent no changes/cleanups needed

##### `modelName`, `rootName` and `versionName`

In [84]:

# first clearing out the leading and trailing spaces in modelName

df_overview_A['modelName'] = df_overview_A['modelName'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['modelName'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['modelName'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['modelName']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['modelName']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['modelName']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['modelName']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['modelName']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['modelName']== '0'].shape[0],'\n'
     )


null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* Model Name seems to be inconsitent with some year suffixes

In [85]:
df_overview_A['modelName'].unique()

array(['GLE', 'Vitara Brezza [2016-2020]', 'Figo [2010-2012]',
       'Ritz [2009-2012]', 'Alto 800', 'Vento [2012-2014]', 'Magnite',
       'i20 [2008-2010]', 'Triber [2019-2023]', 'Swift [2014-2018]',
       'Zen [1996-2003]', 'Alto K10', 'City [2011-2014]',
       'Swift DZire [2011-2015]', 'City ZX', 'Kwid', 'Amaze [2018-2021]',
       'Creta [2018-2019]', 'Grand i10 [2013-2017]',
       'Alto K10 [2010-2014]', 'Estilo [2006-2009]',
       'Wagon R [2006-2010]', 'Figo [2012-2015]',
       'Etios Liva [2011-2013]', 'Baleno', 'Santro Xing [2008-2015]',
       'Verna [2006-2010]', 'Verna [2011-2015]',
       'Swift Dzire [2015-2017]', 'Fabia', 'Ertiga [2012-2015]',
       'Dzire [2017-2020]', 'Safari [2015-2017]', 'Eeco',
       'Alto 800 [2012-2016]', 'Wagon R 1.0 [2010-2013]',
       'S-Presso [2019-2022]', 'Verito [2011-2012]',
       'Hector Plus [2020-2023]', 'S-Cross [2017-2020]', 'Estilo',
       'Wagon R 1.0 [2014-2019]', 'Scorpio', 'Bolero [2020-2022]',
       'City 4th Gener

* column 'rootName' consists of model names without the year names

In [86]:
df_overview_A['rootName'].unique()

array(['GLE', 'Vitara Brezza', 'Figo', 'Ritz', 'Alto 800', 'Vento',
       'Magnite', 'i20', 'Triber', 'Swift', 'Zen', 'Alto K10', 'City',
       'Swift DZire', 'Kwid', 'Amaze', 'Creta', 'Grand i10', 'Alto',
       'Estilo', 'Wagon R', 'Etios Liva', 'Baleno', 'Santro', 'Verna',
       'Fabia', 'Ertiga', 'DZire', 'Safari', 'Eeco', 'S-Presso',
       'Logan/Verito', 'Hector Plus', 'S-Cross', 'Scorpio', 'Bolero',
       'Grand i10 NIOS', 'Armada', 'Accord', '800', 'Fiesta/Classic',
       'Aura', 'Vista', 'Ecosport', 'Polo', 'Tiago', 'Zest', 'Corolla',
       'Freestyle', 'Kiger', 'C-Class', 'EQB', 'E-Class', 'GLA',
       'A-Class Limousine', 'GLB', 'A4', 'AMG A35 Limousine', 'Indigo',
       'Xylo', 'Exter', 'Venture', 'XUV500', 'Nexon', 'XUV300', 'Indica',
       'Getz', 'Fortuner', 'Omni', 'Urban Cruiser Hyryder', 'Thar',
       'Celerio X', 'Celerio', 'X5', 'X1', 'WR-V', 'TUV300',
       'Discovery Sport', 'XE', 'X3', '7-Series', 'XC60', 'GL-Class',
       '6-Series GT', '3-Series', 

In [87]:
# first clearing out the leading and trailing spaces from versionName

df_overview_A['versionName'] = df_overview_A['versionName'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['versionName'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['versionName'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['versionName']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['versionName']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['versionName']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['versionName']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['versionName']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['versionName']== '0'].shape[0],'\n'
     )



null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [88]:
list(df_overview_A['versionName'].unique())

['300d 4MATIC LWB',
 'VDi (O) [2016-2018]',
 'VDi',
 'Duratorq Diesel EXI 1.4',
 'GENUS VXI',
 'LXi (O)',
 'Comfortline Diesel',
 'XE  [2020]',
 'LXi (O) CNG',
 'Asta 1.2',
 'RXL',
 'VXi [2014-2017]',
 'LXi',
 'VXi Plus AGS',
 '1.5 V AT',
 'VXI',
 'VTEC',
 'CLIMBER (O) 1.0 AMT Dual Tone',
 '1.2 E MT Petrol [2018-2020]',
 'E Plus 1.4 CRDi',
 'Sportz 1.1 CRDi [2013-2016]',
 'ZDi Plus',
 'LX',
 'LXi Minor',
 'Duratorq Diesel Titanium 1.4',
 'GD',
 'Zeta AGS',
 'GENUS VDI',
 'GL',
 'CRDI VGT SX 1.5',
 'Fluidic 1.6 CRDi SX Opt AT',
 'VDi ABS',
 'Active 1.2 MPI',
 'GL LPG',
 'ZDi',
 'Sports Edition 1.2L Kappa VTVT',
 '4x2 EXi BS-III',
 '5 STR AC [2022-2023]',
 'Lxi',
 'VXi',
 '1.5 D4 BS-IV',
 'Sharp Hybrid 1.5 Petrol',
 'GXi',
 'Alpha 1.3',
 'VXi ABS BS-IV',
 'VDI',
 'S11 MT 7S CC',
 'B6',
 'SV Petrol',
 'VDI SHVS',
 'Asta 1.2 Kappa VTVT [2023]',
 'VXi Plus AGS [2022-2023]',
 'AC',
 'S11',
 'N10',
 '1.8 MT',
 'AC BS-II',
 '5 STR AC CNG',
 'EXi 1.4',
 'SX 1.2 (O) Petrol',
 'LS TDI BS-III',
 '

* modelName has year data, might not be required for analysis hence deleting that column
* changing 'rootName' to 'modelName'
* keeping version name intact

In [89]:
df_overview_A[['modelName','rootName']].head(2)

Unnamed: 0,modelName,rootName
0,GLE,GLE
1,Vitara Brezza [2016-2020],Vitara Brezza


In [90]:
df_overview_A.drop(columns = ['modelName'],inplace = True)

In [91]:
df_overview_A.rename({'rootName':'modelName'},axis = 1,inplace=True)

In [92]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 19 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   profileId       59708 non-null  object 
 1   vehicle         59708 non-null  object 
 2   price           59708 non-null  object 
 3   kilometers      59708 non-null  float64
 4   fuelName        59708 non-null  object 
 5   insurance       59422 non-null  object 
 6   makeName        59708 non-null  object 
 7   versionName     59708 non-null  object 
 8   makeYear        59708 non-null  int64  
 9   makeMonth       59708 non-null  object 
 10  city            59708 non-null  object 
 11  noOfOwners      59708 non-null  object 
 12  registerCity    59708 non-null  object 
 13  carAvailbaleAt  59708 non-null  object 
 14  regType         25222 non-null  object 
 15  priceNumeric    59708 non-null  int64  
 16  modelName       59708 non-null  object 
 17  state           59708 non-null 

##### `makeYear` and `makeMonth`

In [93]:
df_overview_A['makeYear'].unique()

array([2022, 2016, 2018, 2011, 2012, 2023, 2021, 2009, 2020, 2014, 2000,
       2013, 2008, 2019, 2010, 2007, 2017, 2015, 2003, 2001, 2006, 2004,
       2002, 2005, 1999, 1998, 1995, 1991, 1988, 1994, 1996, 1993, 1900],
      dtype=int64)

In [94]:
df_overview_A['makeMonth'].unique()

array(['Apr', 'Aug', 'May', 'Sep', 'Mar', 'Jan', 'Dec', 'Jul', 'Feb',
       'Oct', 'Nov', 'Jun'], dtype=object)

* No inconsitencies in make year or month

##### `noOfOwners`

In [95]:
df_overview_A['noOfOwners'].unique()

array(['First', 'Second', '4 or More', 'Third', 'UnRegistered Car',
       'Fourth'], dtype=object)

In [96]:
df_overview_A['noOfOwners'].replace('UnRegistered Car','Unregistered',inplace=True)

In [97]:
df_overview_A['noOfOwners'].unique()

array(['First', 'Second', '4 or More', 'Third', 'Unregistered', 'Fourth'],
      dtype=object)

* corrected a value
* no of owners data is now consistent

##### `registerCity`

In [98]:
df_overview_A['registerCity'].unique()

array(['Not Available', 'Agra', 'Mall road', 'kanpur', 'AHMEDABAD',
       'BAVLA', 'MODASA', 'VADODARA', 'BHARUCH', 'BAWLA',
       'AHMEDABAD EAST', 'ANAND', 'RAJKOT', 'JAMNAGAR', 'GIR SOMNATH',
       'GHANDHINAGER', 'Chhotaudapur', 'SURAT', 'BANASKANTHA',
       'AHEMDABAD', 'ahmedabad', 'Ahmedabad', 'Gandhinagar',
       'GJ 01 KE3559', 'GANDHINAGAR', 'Madgoan Goa', 'pune',
       'Aizawl, Mizoram', 'Jaipur', 'Mumbai', 'mumbai', 'panvel',
       'Kanpur', 'Lucknow', 'Allahabad', 'JAIPUR', 'Nagpur East',
       'Yavatmal', 'NAGPUR URBAN', 'YAVATMAL', 'yavatmal', 'Nagpur',
       'Nagpur Rular', 'NAGPUR', 'Chandigarh', 'Punjab', 'PB', 'Ludhiana',
       'sangrur', 'chandigarh', 'DD', 'Anand, Gujarat', 'Patna', 'Bihar',
       'Nashik', 'Satara', 'Aurangabad', 'Aurangabad Maharashtra ',
       'KALYAN', 'Navi Mumbai', 'BORIVALI', 'Borivali', 'Vasai', 'Kalyan',
       'Mumbai east', 'pen', 'BENGALURU', 'MYSURU', 'BELGAUM', 'BELLARY',
       'SHIVAMOGGA', 'TUMKUR', 'HOSPET', 'NELAMANGA

* register city is inconsistent and is not needed for analysis, hence removing the column

In [99]:
df_overview_A.drop(columns = ['registerCity'],inplace=True)

In [100]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   profileId       59708 non-null  object 
 1   vehicle         59708 non-null  object 
 2   price           59708 non-null  object 
 3   kilometers      59708 non-null  float64
 4   fuelName        59708 non-null  object 
 5   insurance       59422 non-null  object 
 6   makeName        59708 non-null  object 
 7   versionName     59708 non-null  object 
 8   makeYear        59708 non-null  int64  
 9   makeMonth       59708 non-null  object 
 10  city            59708 non-null  object 
 11  noOfOwners      59708 non-null  object 
 12  carAvailbaleAt  59708 non-null  object 
 13  regType         25222 non-null  object 
 14  priceNumeric    59708 non-null  int64  
 15  modelName       59708 non-null  object 
 16  state           59708 non-null  object 
 17  transmission    59702 non-null 

##### `priceNumeric` and `price`

In [101]:
# checking inconsistencies in price and price Numeric 
print( "null - ", df_overview_A[df_overview_A['price'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['price'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['price']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['price']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['price']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['price']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['price']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['price']== '0'].shape[0],'\n'
     )


null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [102]:
# checking inconsistencies in price and price Numeric 
print( "null - ", df_overview_A[df_overview_A['priceNumeric'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['priceNumeric'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['priceNumeric']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['priceNumeric']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['priceNumeric']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['priceNumeric']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['priceNumeric']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['priceNumeric']== '0'].shape[0],'\n'
     )

null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* The priceNumeric data is consistent and does not need any cleaning
* Removing the price (string column) and renaming priceNumeric to price

In [103]:
df_overview_A.drop("price",axis=1,inplace = True)

In [104]:
df_overview_A.rename(columns = {'priceNumeric':'price'},inplace=True)

In [105]:
df_overview_A.head(2)

Unnamed: 0,profileId,vehicle,kilometers,fuelName,insurance,makeName,versionName,makeYear,makeMonth,city,noOfOwners,carAvailbaleAt,regType,price,modelName,state,transmission
0,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,6000.0,Diesel + Diesel,ThirdParty,Mercedes-Benz,300d 4MATIC LWB,2022,Apr,A&N Islands,First,A&N Islands,Individual,7500000,GLE,Andaman and Nicobar Islands,Automatic
1,S2730115,Maruti Suzuki Vitara Brezza VDi (O) [2016-2018],55000.0,Diesel + Diesel,Comprehensive,Maruti Suzuki,VDi (O) [2016-2018],2016,Aug,A&N Islands,First,A&N Islands,Individual,600000,Vitara Brezza,Andaman and Nicobar Islands,Manual


In [106]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   profileId       59708 non-null  object 
 1   vehicle         59708 non-null  object 
 2   kilometers      59708 non-null  float64
 3   fuelName        59708 non-null  object 
 4   insurance       59422 non-null  object 
 5   makeName        59708 non-null  object 
 6   versionName     59708 non-null  object 
 7   makeYear        59708 non-null  int64  
 8   makeMonth       59708 non-null  object 
 9   city            59708 non-null  object 
 10  noOfOwners      59708 non-null  object 
 11  carAvailbaleAt  59708 non-null  object 
 12  regType         25222 non-null  object 
 13  price           59708 non-null  int64  
 14  modelName       59708 non-null  object 
 15  state           59708 non-null  object 
 16  transmission    59702 non-null  object 
dtypes: float64(1), int64(2), object

##### `regType`

In [107]:
# first clearing out the leading and trailing spaces

df_overview_A['regType'] = df_overview_A['regType'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_overview_A[df_overview_A['regType'].isnull()].shape[0],'\n',
       "nan - ",df_overview_A[df_overview_A['regType'].isna()].shape[0],'\n',
       "empty - ", df_overview_A[df_overview_A['regType']==''].shape[0],'\n',
       "space - ",df_overview_A[df_overview_A['regType']==' '].shape[0],'\n',
       "NA - ", df_overview_A[df_overview_A['regType']=='Not Available'].shape[0],'\n',
       "None - ", df_overview_A[df_overview_A['regType']== None].shape[0],'\n',
       "0 - ",  df_overview_A[df_overview_A['regType']== 0].shape[0],'\n',
       "0(str) - ",  df_overview_A[df_overview_A['regType']== '0'].shape[0],'\n'
     )

null -  34486 
 nan -  34486 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [108]:
df_overview_A['regType'].unique()

array(['Individual', 'Corporate', nan, 'Taxi'], dtype=object)

* regType is inconsistent but can be kept for analysis

In [109]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 17 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   profileId       59708 non-null  object 
 1   vehicle         59708 non-null  object 
 2   kilometers      59708 non-null  float64
 3   fuelName        59708 non-null  object 
 4   insurance       59422 non-null  object 
 5   makeName        59708 non-null  object 
 6   versionName     59708 non-null  object 
 7   makeYear        59708 non-null  int64  
 8   makeMonth       59708 non-null  object 
 9   city            59708 non-null  object 
 10  noOfOwners      59708 non-null  object 
 11  carAvailbaleAt  59708 non-null  object 
 12  regType         25222 non-null  object 
 13  price           59708 non-null  int64  
 14  modelName       59708 non-null  object 
 15  state           59708 non-null  object 
 16  transmission    59702 non-null  object 
dtypes: float64(1), int64(2), object

In [110]:
df_overview_A.drop(columns = ['carAvailbaleAt'], inplace = True)

In [111]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   profileId     59708 non-null  object 
 1   vehicle       59708 non-null  object 
 2   kilometers    59708 non-null  float64
 3   fuelName      59708 non-null  object 
 4   insurance     59422 non-null  object 
 5   makeName      59708 non-null  object 
 6   versionName   59708 non-null  object 
 7   makeYear      59708 non-null  int64  
 8   makeMonth     59708 non-null  object 
 9   city          59708 non-null  object 
 10  noOfOwners    59708 non-null  object 
 11  regType       25222 non-null  object 
 12  price         59708 non-null  int64  
 13  modelName     59708 non-null  object 
 14  state         59708 non-null  object 
 15  transmission  59702 non-null  object 
dtypes: float64(1), int64(2), object(13)
memory usage: 7.3+ MB


##### `insurance`

In [112]:
df_overview_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59708 entries, 0 to 59707
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   profileId     59708 non-null  object 
 1   vehicle       59708 non-null  object 
 2   kilometers    59708 non-null  float64
 3   fuelName      59708 non-null  object 
 4   insurance     59422 non-null  object 
 5   makeName      59708 non-null  object 
 6   versionName   59708 non-null  object 
 7   makeYear      59708 non-null  int64  
 8   makeMonth     59708 non-null  object 
 9   city          59708 non-null  object 
 10  noOfOwners    59708 non-null  object 
 11  regType       25222 non-null  object 
 12  price         59708 non-null  int64  
 13  modelName     59708 non-null  object 
 14  state         59708 non-null  object 
 15  transmission  59702 non-null  object 
dtypes: float64(1), int64(2), object(13)
memory usage: 7.3+ MB


In [113]:
df_overview_A['insurance'].unique()

array(['ThirdParty', 'Comprehensive', 'Expired', 'Not Available',
       'Third Party', nan, 'No Insurance'], dtype=object)

* data is consistent, but there are a lot of unavailable and empty values.
* keeping this column for analysis
* correcting 'ThirdParty' to 'Third Party'

In [114]:
df_overview_A['insurance'].replace('ThirdParty','Third Party',inplace = True)

In [115]:
df_overview_A['insurance'].unique()

array(['Third Party', 'Comprehensive', 'Expired', 'Not Available', nan,
       'No Insurance'], dtype=object)

In [116]:
df_overview_A[df_overview_A['insurance'].isnull()]

Unnamed: 0,profileId,vehicle,kilometers,fuelName,insurance,makeName,versionName,makeYear,makeMonth,city,noOfOwners,regType,price,modelName,state,transmission
4816,D4149539,Tata Tiago Revotron XZ,65000.0,Petrol,,Tata,Revotron XZ,2018,Jun,Haldwani,First,Individual,425000,Tiago,Uttarakhand,Manual
4819,D4181735,Maruti Suzuki Swift VXi,74947.0,Petrol,,Maruti Suzuki,VXi,2017,Jun,Haldwani,First,Individual,490000,Swift,Uttarakhand,Manual
4820,D4149591,Toyota Urban Cruiser High Grade MT,17000.0,Petrol,,Toyota,High Grade MT,2021,Jun,Haldwani,First,Individual,975000,Urban Cruiser,Uttarakhand,Manual
22561,D4177823,Maserati Levante Diesel,50000.0,Diesel,,Maserati,Diesel,2017,Jun,Thiruvananthapuram,Second,Individual,7700000,Levante,Kerala,Automatic
22600,S2732787,Hyundai Grand i10 Nios Sportz 1.2 Kappa VTVT,3300.0,Petrol + Petrol,,Hyundai,Sportz 1.2 Kappa VTVT,2022,Jun,Thiruvananthapuram,First,Individual,780000,Grand i10 NIOS,Kerala,Manual
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45413,S2741351,Maruti Suzuki Swift ZXi Plus Dual Tone [2021-2...,5000.0,Petrol + CNG,,Maruti Suzuki,ZXi Plus Dual Tone [2021-2023],2023,Jan,Bareilly,First,,951000,Swift,Uttar Pradesh,Manual
45414,S2758821,Ford Aspire Titanium1.5 TDCi,65000.0,Diesel,,Ford,Titanium1.5 TDCi,2015,Oct,Bareilly,First,,450000,Aspire,Uttar Pradesh,Manual
45415,S2771403,Mahindra Scorpio S4,175000.0,Diesel,,Mahindra,S4,2014,May,Bareilly,First,,700000,Scorpio,Uttar Pradesh,Manual
45416,S2784313,Mitsubishi Montero 3.2 Di-D AT,142000.0,Diesel,,Mitsubishi,3.2 Di-D AT,2012,Mar,Bareilly,First,,2400000,Montero,Uttar Pradesh,Automatic


* updating NaN values with 'Not Available'

In [117]:
df_overview_A['insurance'].replace(np.nan,'Not Available',inplace = True)

In [118]:
df_overview_A['insurance'].unique()

array(['Third Party', 'Comprehensive', 'Expired', 'Not Available',
       'No Insurance'], dtype=object)

In [119]:
df_overview_A.shape

(59708, 16)

#### EXPORTING OVERVIEW DATA

In [120]:
df_overview_A.to_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-II\overview_A.csv',index=False)

### `Specifications` DATA

In [121]:
#loading the specifications data
df_specs_A = pd.read_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-I\specifications_A.csv',low_memory=False)

In [122]:
df_specs_A.head(2)

Unnamed: 0,specName,specValue,specUnit,spec_category,carId,profileId,vehicle,city,state
0,Top Speed,225.0,Kmph,Engine & Transmission,1,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,A&N Islands,Andaman Nicobar
1,Acceleration (0-100 kmph),7.2,seconds,Engine & Transmission,1,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,A&N Islands,Andaman Nicobar


In [123]:
df_specs_A.shape

(3191958, 9)

In [124]:
df_specs_A.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3191958 entries, 0 to 3191957
Data columns (total 9 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   specName       object
 1   specValue      object
 2   specUnit       object
 3   spec_category  object
 4   carId          int64 
 5   profileId      object
 6   vehicle        object
 7   city           object
 8   state          object
dtypes: int64(1), object(8)
memory usage: 219.2+ MB


#### REMOVING DUPLICATES

* profileId and specName will be the unique key to identify each record.
* for each car (profileId) 

In [125]:
# taking a copy of the original dataframe before removing duplicates
df_specs_A_1 = df_specs_A.copy()

In [126]:
df_specs_A.nunique()

specName            46
specValue         7286
specUnit            13
spec_category        4
carId            15880
profileId        58908
vehicle           5349
city              1060
state               35
dtype: int64

In [127]:
df_specs_A['spec_category'].unique()

array(['Engine & Transmission', 'Dimensions & Weight', 'Capacity',
       'Suspensions, Brakes, Steering & Tyres'], dtype=object)

* profileId and specName will be the unique key to identify each record.
* for each car (profileId) there should be 46 specifications which fall under the below 4 main specification categories:
    * 'Engine & Transmission'
    * 'Dimensions & Weight'
    * 'Capacity'
    * 'Suspensions, Brakes, Steering & Tyres'
* estimating **46 * 58908** (unique cars) = 2709768 records

In [128]:
df_specs_A.value_counts(['profileId','specName'])

profileId  specName          
D4111563   Rear Suspension       40
           Engine Type           40
           Height                40
           Fuel Type             40
           Fuel Tank Capacity    40
                                 ..
S2673607   Fuel Type              1
           Fuel Tank Capacity     1
           Front Tyres            1
           Front Suspension       1
S2819375   Width                  1
Name: count, Length: 1697127, dtype: int64

* Only 1697127 records are available which means that some cars do not have all the specifications

In [129]:
# checking for a specific profileID
df_specs_A[(df_specs_A['profileId']=='D4111563')].head()

Unnamed: 0,specName,specValue,specUnit,spec_category,carId,profileId,vehicle,city,state
286148,Engine,"2987 cc, 6 Cylinders In V Shape, 4 Valves/Cyli...",,Engine & Transmission,10456,D4111563,Mercedes-Benz GLS 350 d,Bagalkot,Karnataka
286149,Engine Type,V/6 9G-TRONIC,,Engine & Transmission,10456,D4111563,Mercedes-Benz GLS 350 d,Bagalkot,Karnataka
286150,Fuel Type,Diesel,,Engine & Transmission,10456,D4111563,Mercedes-Benz GLS 350 d,Bagalkot,Karnataka
286151,Max Power (bhp@rpm),255 bhp @ 3400 rpm,,Engine & Transmission,10456,D4111563,Mercedes-Benz GLS 350 d,Bagalkot,Karnataka
286152,Max Torque (Nm@rpm),620 Nm @ 1600 rpm,,Engine & Transmission,10456,D4111563,Mercedes-Benz GLS 350 d,Bagalkot,Karnataka


In [130]:
#removing duplicate records
df_specs_A.drop_duplicates(['profileId','specName'],inplace=True)

In [131]:
#value count also reveals no duplicates
df_specs_A.value_counts(['profileId','specName'])

profileId  specName                   
D1820959   Doors                          1
S2728075   Wheelbase                      1
S2728079   Fuel Type                      1
           Fuel Tank Capacity             1
           Front Tyres                    1
                                         ..
D4163003   Wheels                         1
           Wheelbase                      1
           Turbocharger / Supercharger    1
           Transmission                   1
S2819375   Width                          1
Name: count, Length: 1697127, dtype: int64

In [132]:
# checking if there are any duplicates
df_specs_A[(df_specs_A['profileId'] =='D4111563') & (df_specs_A['specName'] =="Fuel Type ")].head()

Unnamed: 0,specName,specValue,specUnit,spec_category,carId,profileId,vehicle,city,state
286150,Fuel Type,Diesel,,Engine & Transmission,10456,D4111563,Mercedes-Benz GLS 350 d,Bagalkot,Karnataka


#### COLUMN-WISE UNDERSTANDING OF DATA AND CLEANING


In [133]:
# taking a clone for contingency
df_specs_A_2 = df_specs_A.copy()

In [134]:
#looking at the data
df_specs_A.head()

Unnamed: 0,specName,specValue,specUnit,spec_category,carId,profileId,vehicle,city,state
0,Top Speed,225,Kmph,Engine & Transmission,1,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,A&N Islands,Andaman Nicobar
1,Acceleration (0-100 kmph),7.2,seconds,Engine & Transmission,1,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,A&N Islands,Andaman Nicobar
2,Engine,"1950 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",,Engine & Transmission,1,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,A&N Islands,Andaman Nicobar
3,Engine Type,OM654 Turbocharged I4,,Engine & Transmission,1,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,A&N Islands,Andaman Nicobar
4,Fuel Type,Diesel,,Engine & Transmission,1,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB,A&N Islands,Andaman Nicobar


In [135]:
# unique values in each column
df_specs_A.nunique()

specName            46
specValue         7286
specUnit            13
spec_category        4
carId            15076
profileId        58908
vehicle           5204
city              1028
state               35
dtype: int64

In [136]:
df_specs_A.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1697127 entries, 0 to 3191957
Data columns (total 9 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   specName       object
 1   specValue      object
 2   specUnit       object
 3   spec_category  object
 4   carId          int64 
 5   profileId      object
 6   vehicle        object
 7   city           object
 8   state          object
dtypes: int64(1), object(8)
memory usage: 129.5+ MB


In [137]:
# deleting carId column as it is not needed

df_specs_A.drop(columns = ['carId'], inplace=True)

In [138]:
df_specs_A.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1697127 entries, 0 to 3191957
Data columns (total 8 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   specName       object
 1   specValue      object
 2   specUnit       object
 3   spec_category  object
 4   profileId      object
 5   vehicle        object
 6   city           object
 7   state          object
dtypes: object(8)
memory usage: 116.5+ MB


- analysing and cleaning each column

##### `specName`

In [139]:
# first clearing out the leading and trailing spaces

df_specs_A['specName'] = df_specs_A['specName'].str.strip()


# checking inconsistencies and null values

print( "null - ", df_specs_A[df_specs_A['specName'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A[df_specs_A['specName'].isna()].shape[0],'\n',
       "empty - ", df_specs_A[df_specs_A['specName']==''].shape[0],'\n',
       "space - ",df_specs_A[df_specs_A['specName']==' '].shape[0],'\n',
       "NA - ", df_specs_A[df_specs_A['specName']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A[df_specs_A['specName']== None].shape[0],'\n',
       "0 - ",  df_specs_A[df_specs_A['specName']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A[df_specs_A['specName']== '0'].shape[0],'\n'
     )


null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [140]:
# unique values in specification Name
df_specs_A['specName'].unique()

array(['Top Speed', 'Acceleration (0-100 kmph)', 'Engine', 'Engine Type',
       'Fuel Type', 'Max Power (bhp@rpm)', 'Max Torque (Nm@rpm)',
       'Mileage (ARAI)', 'Driving Range', 'Drivetrain', 'Transmission',
       'Emission Standard', 'Turbocharger / Supercharger', 'Others',
       'Alternate Fuel', 'Length', 'Width', 'Height', 'Wheelbase',
       'Ground Clearance', 'Doors', 'Seating Capacity',
       'No of Seating Rows', 'Bootspace', 'Fuel Tank Capacity',
       'Four Wheel Steering', 'Front Suspension', 'Rear Suspension',
       'Front Brake Type', 'Rear Brake Type', 'Minimum Turning Radius',
       'Steering Type', 'Wheels', 'Spare Wheel', 'Front Tyres',
       'Rear Tyres', 'Kerb Weight', 'Battery',
       'City Mileage (CarWale Tested)',
       'Highway Mileage (CarWale Tested)', 'Max Motor Performance',
       'Electric Motor', 'Performance on Alternate Fuel',
       'Battery Charging', 'Electric Motor Assist',
       'Range (Carwale Tested)'], dtype=object)

* no discrepancies or inconsistancies

##### `specValue`

* this contains heterogeneous data
* checking for none, NaN or empty values

- checking for null or empty values in specification values

In [141]:
# first clearing out the leading and trailing spaces

df_specs_A['specValue'] = df_specs_A['specValue'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A[df_specs_A['specValue'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A[df_specs_A['specValue'].isna()].shape[0],'\n',
       "empty - ", df_specs_A[df_specs_A['specValue']==''].shape[0],'\n',
       "space - ",df_specs_A[df_specs_A['specValue']==' '].shape[0],'\n',
       "NA - ", df_specs_A[df_specs_A['specValue']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A[df_specs_A['specValue']== None].shape[0],'\n',
       "0 - ",  df_specs_A[df_specs_A['specValue']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A[df_specs_A['specValue']== '0'].shape[0],'\n'
     )


null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  13932 



* specValue contains a lot of '0' values around 13000 records.
* These values can only be analysed by each specification, which will be done after this part.

##### `specUnit`

* checking for null or empty values in specification Unit
* needed when renaming column labels

In [142]:

# first clearing out the leading and trailing spaces

df_specs_A['specUnit'] = df_specs_A['specUnit'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A[df_specs_A['specUnit'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A[df_specs_A['specUnit'].isna()].shape[0],'\n',
       "empty - ", df_specs_A[df_specs_A['specUnit']==''].shape[0],'\n',
       "space - ",df_specs_A[df_specs_A['specUnit']==' '].shape[0],'\n',
       "NA - ", df_specs_A[df_specs_A['specUnit']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A[df_specs_A['specUnit']== None].shape[0],'\n',
       "0 - ",  df_specs_A[df_specs_A['specUnit']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A[df_specs_A['specUnit']== '0'].shape[0],'\n'
     )


null -  997157 
 nan -  997157 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* a lot of null and na values could be seen
* this could be because there might be no units for some specifications ot they might not be available
* units are not required for every record as we will be appending units to the column names
* so ignoring this column

In [143]:
# checking unique units
df_specs_A['specUnit'].unique()

array(['Kmph', 'seconds', nan, 'kmpl', 'Km', 'mm', 'Doors', 'Person',
       'Rows', 'litres', 'metres', 'kg', 'km/kg', 'km/full charge'],
      dtype=object)

##### `profileId`

* checking for null or empty values in profileId which is a unique identifier of each car

In [144]:
# first clearing out the leading and trailing spaces

df_specs_A['profileId'] = df_specs_A['profileId'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A[df_specs_A['profileId'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A[df_specs_A['profileId'].isna()].shape[0],'\n',
       "empty - ", df_specs_A[df_specs_A['profileId']==''].shape[0],'\n',
       "space - ",df_specs_A[df_specs_A['profileId']==' '].shape[0],'\n',
       "NA - ", df_specs_A[df_specs_A['profileId']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A[df_specs_A['profileId']== None].shape[0],'\n',
       "0 - ",  df_specs_A[df_specs_A['profileId']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A[df_specs_A['profileId']== '0'].shape[0],'\n'
     )


null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [145]:
set(pd.Series(df_specs_A['profileId'].unique()).str.slice(stop = 2))

{'D1', 'D2', 'D3', 'D4', 'S2'}

* no inconsistencies in profileId data as well

##### `vehicle`

- checking for null or empty values in vehicle name

In [146]:
# first clearing out the leading and trailing spaces

df_specs_A['vehicle'] = df_specs_A['vehicle'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A[df_specs_A['vehicle'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A[df_specs_A['vehicle'].isna()].shape[0],'\n',
       "empty - ", df_specs_A[df_specs_A['vehicle']==''].shape[0],'\n',
       "space - ",df_specs_A[df_specs_A['vehicle']==' '].shape[0],'\n',
       "NA - ", df_specs_A[df_specs_A['vehicle']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A[df_specs_A['vehicle']== None].shape[0],'\n',
       "0 - ",  df_specs_A[df_specs_A['vehicle']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A[df_specs_A['vehicle']== '0'].shape[0],'\n'
     )


null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* no inconsistencies
* keeping this column for verification purposes

##### `city` and `state`

- city and state column is not needed hence deleting the columns

In [147]:
df_specs_A.drop(columns = ['city','state'],inplace = True)

In [148]:
df_specs_A.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1697127 entries, 0 to 3191957
Data columns (total 6 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   specName       object
 1   specValue      object
 2   specUnit       object
 3   spec_category  object
 4   profileId      object
 5   vehicle        object
dtypes: object(6)
memory usage: 90.6+ MB


#### LOOKING INTO SPECIFICATION CATEGORIES

* The dataset will be split by CATEGORIES for ease of analysis
* There are 4 categories and data will be split according to this:
    * 'Engine & Transmission'
    * 'Dimensions & Weight'
    * 'Capacity'
    * 'Suspensions, Brakes, Steering & Tyres'

In [149]:
df_specs_A[(df_specs_A['spec_category']=='Dimensions & Weight') & (df_specs_A['profileId']=='S2725575' ) ]

Unnamed: 0,specName,specValue,specUnit,spec_category,profileId,vehicle
15,Length,4924,mm,Dimensions & Weight,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
16,Width,2157,mm,Dimensions & Weight,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
17,Height,1772,mm,Dimensions & Weight,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
18,Wheelbase,2995,mm,Dimensions & Weight,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
19,Ground Clearance,215,mm,Dimensions & Weight,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB


In [150]:
df_specs_A[(df_specs_A['spec_category']=='Capacity') & (df_specs_A['profileId']=='S2725575' ) ]

Unnamed: 0,specName,specValue,specUnit,spec_category,profileId,vehicle
20,Doors,5,Doors,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
21,Seating Capacity,5,Person,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
22,No of Seating Rows,2,Rows,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
23,Bootspace,630,litres,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
24,Fuel Tank Capacity,93,litres,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB


In [151]:
df_specs_A[(df_specs_A['spec_category']=='Suspensions, Brakes, Steering & Tyres') & (df_specs_A['profileId']=='S2725575' ) ]

Unnamed: 0,specName,specValue,specUnit,spec_category,profileId,vehicle
25,Four Wheel Steering,0,,"Suspensions, Brakes, Steering & Tyres",S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
26,Front Suspension,"Independent, Double Wishbone, Coil Springs",,"Suspensions, Brakes, Steering & Tyres",S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
27,Rear Suspension,"Independent, Multi-link, Coil Springs",,"Suspensions, Brakes, Steering & Tyres",S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
28,Front Brake Type,Ventilated Disc,,"Suspensions, Brakes, Steering & Tyres",S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
29,Rear Brake Type,Ventilated Disc,,"Suspensions, Brakes, Steering & Tyres",S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
30,Minimum Turning Radius,5.9,metres,"Suspensions, Brakes, Steering & Tyres",S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
31,Steering Type,Power assisted (Electric),,"Suspensions, Brakes, Steering & Tyres",S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
32,Wheels,Alloy Wheels,,"Suspensions, Brakes, Steering & Tyres",S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
33,Spare Wheel,Space Saver,,"Suspensions, Brakes, Steering & Tyres",S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
34,Front Tyres,265 / 45 R19,,"Suspensions, Brakes, Steering & Tyres",S2725575,Mercedes-Benz GLE 300d 4MATIC LWB


- The only category that might be useful for our analysis
    * 'Engine & Transmission'
    * 'Dimensions & Weight'
    * 'Capacity'
- Ignoring Suspensions, Brakes, Steering & Tyres category

#### `Engine & Transmission`

In [152]:
#looking at the data
df_specs_A[df_specs_A['spec_category'] == 'Engine & Transmission'].head()

Unnamed: 0,specName,specValue,specUnit,spec_category,profileId,vehicle
0,Top Speed,225,Kmph,Engine & Transmission,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
1,Acceleration (0-100 kmph),7.2,seconds,Engine & Transmission,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
2,Engine,"1950 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",,Engine & Transmission,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
3,Engine Type,OM654 Turbocharged I4,,Engine & Transmission,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
4,Fuel Type,Diesel,,Engine & Transmission,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB


In [153]:
# copying Engine category into a new variable
df_specs_A_Engine = df_specs_A[df_specs_A['spec_category'] == 'Engine & Transmission'].copy()

In [154]:
pivot_columns = df_specs_A_Engine['specName'].unique()

##### PIVOTING TABLE

In [155]:
df_specs_A_Engine = df_specs_A_Engine.pivot(index =['profileId','vehicle'], columns = 'specName', values ='specValue')

In [156]:
df_specs_A_Engine.head()

Unnamed: 0_level_0,specName,Acceleration (0-100 kmph),Alternate Fuel,Battery,Battery Charging,City Mileage (CarWale Tested),Drivetrain,Driving Range,Electric Motor,Electric Motor Assist,Emission Standard,...,Max Motor Performance,Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Others,Performance on Alternate Fuel,Range (Carwale Tested),Top Speed,Transmission,Turbocharger / Supercharger
profileId,vehicle,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,,,,,,FWD,,,,,...,,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,16.95,,,,,Automatic - 4 Gears,
D1982769,Audi A8 L 50 TDI,,,,,,4WD / AWD,,,,,...,,247 bhp @ 4000 rpm,580 Nm @ 1750 rpm,16.77,,,,,"Automatic - 8 Gears, Paddle Shift, Sport Mode",Turbocharged
D2136401,Honda Amaze 1.5 EX i-DTEC,,,,,,FWD,,,,,...,,99 bhp @ 3600 rpm,200 Nm @ 1750 rpm,25.8,,,,,Manual - 5 Gears,Turbocharged
D2184679,Honda City S,,,,,,FWD,,,,,...,,117 bhp @ 6600 rpm,145 Nm @ 4600 rpm,17.8,,,,,Manual - 5 Gears,No
D2184744,Hyundai Elite i20 Asta 1.4 (O) CRDi,,,,,,FWD,,,,,...,,89 bhp @ 4000 rpm,220 Nm @ 1500 rpm,22.5,,,,,Manual - 6 Gears,Turbocharged


In [157]:
# resetting index
df_specs_A_Engine.reset_index(inplace = True)

In [158]:
df_specs_A_Engine.index

RangeIndex(start=0, stop=58635, step=1)

In [159]:
# renaming the axis name after resetting
df_specs_A_Engine.rename_axis('',axis = 1,inplace=True)

In [160]:
df_specs_A_Engine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58635 entries, 0 to 58634
Data columns (total 26 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   profileId                         58635 non-null  object
 1   vehicle                           58635 non-null  object
 2   Acceleration (0-100 kmph)         3061 non-null   object
 3   Alternate Fuel                    15432 non-null  object
 4   Battery                           699 non-null    object
 5   Battery Charging                  188 non-null    object
 6   City Mileage (CarWale Tested)     176 non-null    object
 7   Drivetrain                        55020 non-null  object
 8   Driving Range                     11503 non-null  object
 9   Electric Motor                    452 non-null    object
 10  Electric Motor Assist             19 non-null     object
 11  Emission Standard                 22465 non-null  object
 12  Engine            

In [161]:
# unique count of values of each column
df_specs_A_Engine.nunique()


profileId                           58635
vehicle                              5165
Acceleration (0-100 kmph)             126
Alternate Fuel                          5
Battery                                38
Battery Charging                       14
City Mileage (CarWale Tested)           5
Drivetrain                              6
Driving Range                         348
Electric Motor                         12
Electric Motor Assist                   5
Emission Standard                       4
Engine                                441
Engine Type                           815
Fuel Type                               9
Highway Mileage (CarWale Tested)        5
Max Motor Performance                  32
Max Power (bhp@rpm)                   761
Max Torque (Nm@rpm)                   723
Mileage (ARAI)                        656
Others                                  5
Performance on Alternate Fuel          16
Range (Carwale Tested)                  5
Top Speed                        

In [162]:
df_specs_A_Engine[df_specs_A_Engine.notnull()]

Unnamed: 0,profileId,vehicle,Acceleration (0-100 kmph),Alternate Fuel,Battery,Battery Charging,City Mileage (CarWale Tested),Drivetrain,Driving Range,Electric Motor,...,Max Motor Performance,Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Others,Performance on Alternate Fuel,Range (Carwale Tested),Top Speed,Transmission,Turbocharger / Supercharger
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,,,,,,FWD,,,...,,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,16.95,,,,,Automatic - 4 Gears,
1,D1982769,Audi A8 L 50 TDI,,,,,,4WD / AWD,,,...,,247 bhp @ 4000 rpm,580 Nm @ 1750 rpm,16.77,,,,,"Automatic - 8 Gears, Paddle Shift, Sport Mode",Turbocharged
2,D2136401,Honda Amaze 1.5 EX i-DTEC,,,,,,FWD,,,...,,99 bhp @ 3600 rpm,200 Nm @ 1750 rpm,25.8,,,,,Manual - 5 Gears,Turbocharged
3,D2184679,Honda City S,,,,,,FWD,,,...,,117 bhp @ 6600 rpm,145 Nm @ 4600 rpm,17.8,,,,,Manual - 5 Gears,No
4,D2184744,Hyundai Elite i20 Asta 1.4 (O) CRDi,,,,,,FWD,,,...,,89 bhp @ 4000 rpm,220 Nm @ 1500 rpm,22.5,,,,,Manual - 6 Gears,Turbocharged
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58630,S2819197,Hyundai Xcent S 1.2 [2014-2016],,,,,,FWD,,,...,,81 bhp @ 6000 rpm,114 Nm @ 4000 rpm,19.1,,,,,Manual - 5 Gears,No
58631,S2819213,Mahindra Bolero Neo Limited Edition [2023],,Not Applicable,,,,RWD,865,,...,,100 bhp @ 3750 rpm,260 Nm @ 1750 rpm,17.2,Idle Start/Stop,,,,Manual - 5 Gears,Turbocharged
58632,S2819265,Hyundai Eon D-Lite +,,,,,,FWD,,,...,,55 bhp @ 5500 rpm,75 Nm @ 4000 rpm,21.1,,,,,Manual - 5 Gears,No
58633,S2819349,Honda City VX CVT,,,,,,FWD,,,...,,117 bhp @ 6600 rpm,145 Nm @ 4600 rpm,18,,,,,"Automatic - CVT Gears, Sport Mode",No


In [163]:
df_specs_A_Engine['Drivetrain'].unique()

array(['FWD', '4WD / AWD', 'AWD', nan, 'RWD', 'AWD with Terrain Mode',
       '4WD'], dtype=object)

* Since we do not have enough data for a few columns we can remove them from analysis.
* Processing by each column and removing them

In [164]:
# removing unwanted columns
df_specs_A_Engine.drop(columns = ['Acceleration (0-100 kmph)','Battery','Battery Charging','Drivetrain','Driving Range',
                                    'Electric Motor','Electric Motor Assist','Emission Standard','Max Motor Performance','Others',
                                     'Performance on Alternate Fuel','Range (Carwale Tested)','Top Speed','Turbocharger / Supercharger'],inplace = True)

In [165]:
df_specs_A_Engine.head(2)

Unnamed: 0,profileId,vehicle,Alternate Fuel,City Mileage (CarWale Tested),Engine,Engine Type,Fuel Type,Highway Mileage (CarWale Tested),Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Transmission
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,,,"1197 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",4 cylinder inline petrol engine,Petrol,,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,16.95,Automatic - 4 Gears
1,D1982769,Audi A8 L 50 TDI,,,"2967 cc, 6 Cylinders In V Shape, 4 Valves/Cyli...",V6 diesel engine with common rail injection sy...,Diesel,,247 bhp @ 4000 rpm,580 Nm @ 1750 rpm,16.77,"Automatic - 8 Gears, Paddle Shift, Sport Mode"


In [166]:
df_specs_A_Engine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58635 entries, 0 to 58634
Data columns (total 12 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   profileId                         58635 non-null  object
 1   vehicle                           58635 non-null  object
 2   Alternate Fuel                    15432 non-null  object
 3   City Mileage (CarWale Tested)     176 non-null    object
 4   Engine                            58300 non-null  object
 5   Engine Type                       53517 non-null  object
 6   Fuel Type                         58581 non-null  object
 7   Highway Mileage (CarWale Tested)  176 non-null    object
 8   Max Power (bhp@rpm)               58071 non-null  object
 9   Max Torque (Nm@rpm)               58071 non-null  object
 10  Mileage (ARAI)                    54078 non-null  object
 11  Transmission                      58635 non-null  object
dtypes: object(12)
memo

##### `fuel`

In [167]:
# changing 'Fuel Type ' column name to 'Fuel'
df_specs_A_Engine.rename({'Fuel Type':'fuel'},axis=1,inplace=True)
df_specs_A_Engine.head(2)

Unnamed: 0,profileId,vehicle,Alternate Fuel,City Mileage (CarWale Tested),Engine,Engine Type,fuel,Highway Mileage (CarWale Tested),Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Transmission
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,,,"1197 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",4 cylinder inline petrol engine,Petrol,,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,16.95,Automatic - 4 Gears
1,D1982769,Audi A8 L 50 TDI,,,"2967 cc, 6 Cylinders In V Shape, 4 Valves/Cyli...",V6 diesel engine with common rail injection sy...,Diesel,,247 bhp @ 4000 rpm,580 Nm @ 1750 rpm,16.77,"Automatic - 8 Gears, Paddle Shift, Sport Mode"


In [168]:
# first clearing out the leading and trailing spaces

df_specs_A_Engine['fuel'] = df_specs_A_Engine['fuel'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Engine[df_specs_A_Engine['fuel'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Engine[df_specs_A_Engine['fuel'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Engine[df_specs_A_Engine['fuel']==''].shape[0],'\n',
       "space - ",df_specs_A_Engine[df_specs_A_Engine['fuel']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Engine[df_specs_A_Engine['fuel']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Engine[df_specs_A_Engine['fuel']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Engine[df_specs_A_Engine['fuel']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Engine[df_specs_A_Engine['fuel']== '0'].shape[0],'\n'
     )


null -  54 
 nan -  54 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* a few null and nan values where found
* there are LPG and CNG vehicles listed within Alternate Fuel
* checking if we could compensate the fuel column with alternate fuel

In [169]:
df_specs_A_Engine[(df_specs_A_Engine['fuel'].isna()) & (~df_specs_A_Engine['Alternate Fuel'].isna())].shape

(54, 12)

In [170]:
df_specs_A_Engine[(df_specs_A_Engine['fuel'].isna()) & (~df_specs_A_Engine['Alternate Fuel'].isna())]['Alternate Fuel'].unique()

array(['LPG', 'CNG'], dtype=object)

In [171]:
df_specs_A_Engine[(df_specs_A_Engine['fuel'].isna()) & (~df_specs_A_Engine['Alternate Fuel'].isna())]

Unnamed: 0,profileId,vehicle,Alternate Fuel,City Mileage (CarWale Tested),Engine,Engine Type,fuel,Highway Mileage (CarWale Tested),Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Transmission
38,D2677249,Maruti Suzuki Wagon R 1.0 LXi LPG,LPG,,"998 cc, 3 Cylinders Inline, 4 Valves/Cylinder",K10B,,,46 bhp @ 6200 rpm,85 Nm @ 3500 rpm,13.1,Manual - 5 Gears
743,D3597535,Maruti Suzuki SX4 VXI CNG BS-IV,CNG,,"1586 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",16V DOHC VVT,,,87 bhp @ 5600 rpm,122 Nm @ 4100 rpm,21.4,Manual - 5 Gears
2145,D3882207,Maruti Suzuki Alto LXi CNG,CNG,,"796 cc, 3 Cylinders Inline, 4 Valves/Cylinder",FC engine,,,39 bhp @ 6200 rpm,54 Nm @ 3000 rpm,26.83,Manual - 5 Gears
4578,D4010011,Maruti Suzuki Alto LXi CNG,CNG,,"796 cc, 3 Cylinders Inline, 4 Valves/Cylinder",FC engine,,,39 bhp @ 6200 rpm,54 Nm @ 3000 rpm,26.83,Manual - 5 Gears
4720,D4014089,Maruti Suzuki SX4 VXI CNG BS-IV,CNG,,"1586 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",16V DOHC VVT,,,87 bhp @ 5600 rpm,122 Nm @ 4100 rpm,21.4,Manual - 5 Gears
4834,D4016849,Maruti Suzuki Wagon R 1.0 LXi LPG,LPG,,"998 cc, 3 Cylinders Inline, 4 Valves/Cylinder",K10B,,,46 bhp @ 6200 rpm,85 Nm @ 3500 rpm,13.1,Manual - 5 Gears
10534,D4109083,Maruti Suzuki Alto LXi CNG,CNG,,"796 cc, 3 Cylinders Inline, 4 Valves/Cylinder",FC engine,,,39 bhp @ 6200 rpm,54 Nm @ 3000 rpm,26.83,Manual - 5 Gears
10580,D4109567,Maruti Suzuki Alto LXi CNG,CNG,,"796 cc, 3 Cylinders Inline, 4 Valves/Cylinder",FC engine,,,39 bhp @ 6200 rpm,54 Nm @ 3000 rpm,26.83,Manual - 5 Gears
17568,D4153601,Maruti Suzuki SX4 VXI CNG BS-IV,CNG,,"1586 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",16V DOHC VVT,,,87 bhp @ 5600 rpm,122 Nm @ 4100 rpm,21.4,Manual - 5 Gears
23737,D4181115,Maruti Suzuki Alto LXi CNG,CNG,,"796 cc, 3 Cylinders Inline, 4 Valves/Cylinder",FC engine,,,39 bhp @ 6200 rpm,54 Nm @ 3000 rpm,26.83,Manual - 5 Gears


In [172]:
df_specs_A_Engine.head(1)

Unnamed: 0,profileId,vehicle,Alternate Fuel,City Mileage (CarWale Tested),Engine,Engine Type,fuel,Highway Mileage (CarWale Tested),Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Transmission
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,,,"1197 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",4 cylinder inline petrol engine,Petrol,,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,16.95,Automatic - 4 Gears


In [173]:
# function to update fuel
def value_update(row):
    if str(row[6]) == 'nan':  # updating 8th index of row which is the actual fuel
        if str(row[2]) != 'nan':
            row[6] = row[2]  # updating 8th index of row with 4th index value of row value which is the alternate fuel
    return row

In [174]:
df_specs_A_Engine = df_specs_A_Engine.apply(value_update,axis = 1)

In [175]:
# checking if the update is successful
df_specs_A_Engine[(df_specs_A_Engine['fuel'].isna()) & (~df_specs_A_Engine['Alternate Fuel'].isna())]

Unnamed: 0,profileId,vehicle,Alternate Fuel,City Mileage (CarWale Tested),Engine,Engine Type,fuel,Highway Mileage (CarWale Tested),Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Transmission


In [176]:
df_specs_A_Engine[df_specs_A_Engine['profileId']=='D2677249']

Unnamed: 0,profileId,vehicle,Alternate Fuel,City Mileage (CarWale Tested),Engine,Engine Type,fuel,Highway Mileage (CarWale Tested),Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Transmission
38,D2677249,Maruti Suzuki Wagon R 1.0 LXi LPG,LPG,,"998 cc, 3 Cylinders Inline, 4 Valves/Cylinder",K10B,LPG,,46 bhp @ 6200 rpm,85 Nm @ 3500 rpm,13.1,Manual - 5 Gears


In [177]:
# checking out unique values within the fuel
df_specs_A_Engine.fuel.unique()

array(['Petrol', 'Diesel', 'LPG', 'CNG', 'Electric',
       'Hybrid (Electric + Petrol)', 'Mild Hybrid(Electric + Petrol)',
       'Mild Hybrid (Electric + Diesel)',
       'Plug-in Hybrid (Electric + Petrol)'], dtype=object)

* there are usual fuel types -  petrol, diesel, cng and lpg
* there are several sub-categories of hybrid, but for analysis purposes, we will consider them as a Hybrid (Electric + Petrol) or Hybrid (Electric + Diesel) category

In [178]:
df_specs_A_Engine.fuel.replace('Mild Hybrid(Electric + Petrol)','Hybrid (Electric + Petrol)',inplace=True)

In [179]:
df_specs_A_Engine.fuel.replace('Mild Hybrid (Electric + Diesel)','Hybrid (Electric + Diesel)',inplace=True)

In [180]:
df_specs_A_Engine.fuel.replace('Plug-in Hybrid (Electric + Petrol)','Hybrid (Electric + Petrol)',inplace=True)

In [181]:
df_specs_A_Engine.fuel.unique()

array(['Petrol', 'Diesel', 'LPG', 'CNG', 'Electric',
       'Hybrid (Electric + Petrol)', 'Hybrid (Electric + Diesel)'],
      dtype=object)

In [182]:
df_specs_A_Engine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58635 entries, 0 to 58634
Data columns (total 12 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   profileId                         58635 non-null  object
 1   vehicle                           58635 non-null  object
 2   Alternate Fuel                    15432 non-null  object
 3   City Mileage (CarWale Tested)     176 non-null    object
 4   Engine                            58300 non-null  object
 5   Engine Type                       53517 non-null  object
 6   fuel                              58635 non-null  object
 7   Highway Mileage (CarWale Tested)  176 non-null    object
 8   Max Power (bhp@rpm)               58071 non-null  object
 9   Max Torque (Nm@rpm)               58071 non-null  object
 10  Mileage (ARAI)                    54078 non-null  object
 11  Transmission                      58635 non-null  object
dtypes: object(12)
memo

In [183]:
# dropping alterante fuel column
df_specs_A_Engine.drop(columns=['Alternate Fuel'],inplace = True)

In [184]:
df_specs_A_Engine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58635 entries, 0 to 58634
Data columns (total 11 columns):
 #   Column                            Non-Null Count  Dtype 
---  ------                            --------------  ----- 
 0   profileId                         58635 non-null  object
 1   vehicle                           58635 non-null  object
 2   City Mileage (CarWale Tested)     176 non-null    object
 3   Engine                            58300 non-null  object
 4   Engine Type                       53517 non-null  object
 5   fuel                              58635 non-null  object
 6   Highway Mileage (CarWale Tested)  176 non-null    object
 7   Max Power (bhp@rpm)               58071 non-null  object
 8   Max Torque (Nm@rpm)               58071 non-null  object
 9   Mileage (ARAI)                    54078 non-null  object
 10  Transmission                      58635 non-null  object
dtypes: object(11)
memory usage: 4.9+ MB


##### `Mileage (ARAI)` ,`City Mileage (CarWale Tested)` and `Highway Mileage (CarWale Tested)`

In [185]:

df_specs_A_Engine['Mileage (ARAI)'] = df_specs_A_Engine['Mileage (ARAI)'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Engine[df_specs_A_Engine['Mileage (ARAI)'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Engine[df_specs_A_Engine['Mileage (ARAI)'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Engine[df_specs_A_Engine['Mileage (ARAI)']==''].shape[0],'\n',
       "space - ",df_specs_A_Engine[df_specs_A_Engine['Mileage (ARAI)']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Engine[df_specs_A_Engine['Mileage (ARAI)']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Engine[df_specs_A_Engine['Mileage (ARAI)']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Engine[df_specs_A_Engine['Mileage (ARAI)']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Engine[df_specs_A_Engine['Mileage (ARAI)']== '0'].shape[0],'\n'
     )


null -  4557 
 nan -  4557 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* huge null/nan values found for mileage

In [186]:
df_specs_A_Engine['City Mileage (CarWale Tested)'] = df_specs_A_Engine['City Mileage (CarWale Tested)'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Engine[df_specs_A_Engine['City Mileage (CarWale Tested)'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Engine[df_specs_A_Engine['City Mileage (CarWale Tested)'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Engine[df_specs_A_Engine['City Mileage (CarWale Tested)']==''].shape[0],'\n',
       "space - ",df_specs_A_Engine[df_specs_A_Engine['City Mileage (CarWale Tested)']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Engine[df_specs_A_Engine['City Mileage (CarWale Tested)']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Engine[df_specs_A_Engine['City Mileage (CarWale Tested)']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Engine[df_specs_A_Engine['City Mileage (CarWale Tested)']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Engine[df_specs_A_Engine['City Mileage (CarWale Tested)']== '0'].shape[0],'\n'
     )


null -  58459 
 nan -  58459 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [187]:
df_specs_A_Engine['Highway Mileage (CarWale Tested)'] = df_specs_A_Engine['Highway Mileage (CarWale Tested)'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Engine[df_specs_A_Engine['Highway Mileage (CarWale Tested)'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Engine[df_specs_A_Engine['Highway Mileage (CarWale Tested)'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Engine[df_specs_A_Engine['Highway Mileage (CarWale Tested)']==''].shape[0],'\n',
       "space - ",df_specs_A_Engine[df_specs_A_Engine['Highway Mileage (CarWale Tested)']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Engine[df_specs_A_Engine['Highway Mileage (CarWale Tested)']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Engine[df_specs_A_Engine['Highway Mileage (CarWale Tested)']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Engine[df_specs_A_Engine['Highway Mileage (CarWale Tested)']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Engine[df_specs_A_Engine['Highway Mileage (CarWale Tested)']== '0'].shape[0],'\n'
     )

null -  58459 
 nan -  58459 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [188]:
# checking if value exists for Highway Mileage (CarWale Tested) with empty City Mileage (CarWale Tested)

df_specs_A_Engine[(~df_specs_A_Engine['City Mileage (CarWale Tested)'].isna()) & (df_specs_A_Engine['Highway Mileage (CarWale Tested)'].isna())]

Unnamed: 0,profileId,vehicle,City Mileage (CarWale Tested),Engine,Engine Type,fuel,Highway Mileage (CarWale Tested),Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Transmission


In [189]:
# checking if value exists for City Mileage (CarWale Tested) with empty or null Highway Mileage (CarWale Tested)

df_specs_A_Engine[(df_specs_A_Engine['City Mileage (CarWale Tested)'].isna()) & (~df_specs_A_Engine['Highway Mileage (CarWale Tested)'].isna())]

Unnamed: 0,profileId,vehicle,City Mileage (CarWale Tested),Engine,Engine Type,fuel,Highway Mileage (CarWale Tested),Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Transmission


* checking if Mileage NaN values could be replaced with City Mileage or Highway Mileage

In [190]:
df_specs_A_Engine[(df_specs_A_Engine['Mileage (ARAI)'].isna()) & (~df_specs_A_Engine['City Mileage (CarWale Tested)'].isna())]

Unnamed: 0,profileId,vehicle,City Mileage (CarWale Tested),Engine,Engine Type,fuel,Highway Mileage (CarWale Tested),Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Transmission


In [191]:
df_specs_A_Engine[(df_specs_A_Engine['Mileage (ARAI)'].isna()) & (~df_specs_A_Engine['Highway Mileage (CarWale Tested)'].isna())]

Unnamed: 0,profileId,vehicle,City Mileage (CarWale Tested),Engine,Engine Type,fuel,Highway Mileage (CarWale Tested),Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),Transmission


In [192]:
# city and highway mileage will not contribute much to analysis
# dropping both city and highway mileage columns
df_specs_A_Engine.drop(columns = ['City Mileage (CarWale Tested)','Highway Mileage (CarWale Tested)'],inplace = True)

In [193]:
df_specs_A_Engine['Mileage (ARAI)'].unique()

array(['16.95', '16.77', '25.8', '17.8', '22.5', '16', '21.9', '15.8',
       '24.95', '20.5', '11.4', '14', '23.4', '13.6', '20.45', '24',
       '24.4', nan, '20', '9.52', '21.19', '19.01', '11.2', '8', '9.4',
       '8.45', '25.31', '17.92', '23', '20.4', '20.3', '21.45', '15',
       '13.1', '18.49', '17.11', '26.59', '14.2', '23.2', '21.01', '12.8',
       '13', '18', '16.7', '17.3', '16.2', '13.32', '19', '21.56', '11.1',
       '21.02', '16.65', '21.4', '19.34', '23.1', '15.96', '18.7', '22.9',
       '22.15', '16.66', '6.38', '18.48', '26.6', '17.4', '17.5', '23.01',
       '14.54', '20.63', '26.209999084472656', '13.96', '17.2', '16.1',
       '16.8', '22.95', '21.93', '21', '15.4', '18.6', '26', '8.2',
       '14.9', '11.45', '19.4', '25.2', '17.9', '24.29', '18.9', '18.15',
       '23.84', '19.1', '15.29', '23.59000015258789', '16.61', '22.54',
       '16.9', '19.16', '19.8', '19.05', '11.5', '13.7', '17.72', '19.56',
       '15.1', '10.78', '12.4', '12', '25.17', '12.07', '

In [194]:
# converting mileage column to float

df_specs_A_Engine['Mileage (ARAI)'] = df_specs_A_Engine['Mileage (ARAI)'].astype('float',copy=False)

In [195]:
df_specs_A_Engine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58635 entries, 0 to 58634
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   profileId            58635 non-null  object 
 1   vehicle              58635 non-null  object 
 2   Engine               58300 non-null  object 
 3   Engine Type          53517 non-null  object 
 4   fuel                 58635 non-null  object 
 5   Max Power (bhp@rpm)  58071 non-null  object 
 6   Max Torque (Nm@rpm)  58071 non-null  object 
 7   Mileage (ARAI)       54078 non-null  float64
 8   Transmission         58635 non-null  object 
dtypes: float64(1), object(8)
memory usage: 4.0+ MB


##### `Transmission`

In [196]:

df_specs_A_Engine['Transmission'] = df_specs_A_Engine['Transmission'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Engine[df_specs_A_Engine['Transmission'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Engine[df_specs_A_Engine['Transmission'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Engine[df_specs_A_Engine['Transmission']==''].shape[0],'\n',
       "space - ",df_specs_A_Engine[df_specs_A_Engine['Transmission']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Engine[df_specs_A_Engine['Transmission']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Engine[df_specs_A_Engine['Transmission']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Engine[df_specs_A_Engine['Transmission']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Engine[df_specs_A_Engine['Transmission']== '0'].shape[0],'\n'
     )


null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [197]:
df_specs_A_Engine['Transmission'].unique()

array(['Automatic - 4 Gears',
       'Automatic - 8 Gears, Paddle Shift, Sport Mode',
       'Manual - 5 Gears', 'Manual - 6 Gears', 'AMT - 5 Gears',
       'Automatic - 9 Gears, Paddle Shift',
       'Automatic - 8 Gears, Manual Override, Sport Mode',
       'Automatic - 6 Gears', 'Automatic - 5 Gears', 'Manual',
       'Automatic - CVT Gears, Sport Mode',
       'Automatic - 7 Gears, Paddle Shift',
       'Automatic - 7 Gears, Paddle Shift, Sport Mode',
       'Automatic - 6 Gears, Manual Override',
       'Automatic - 6 Gears, Manual Override, Sport Mode',
       'Automatic - 9 Gears, Paddle Shift, Sport Mode',
       'Automatic - 9 Gears, Sport Mode',
       'Automatic - CVT Gears, Paddle Shift, Sport Mode',
       'Automatic - 7 Gears, Manual Override & Paddle Shift',
       'Automatic - CVT Gears', 'Automatic - 7 Gears, Manual Override',
       'Automatic - 8 Gears', 'Automatic - 6 Gears, Sport Mode',
       'Automatic - 7 Gears, Manual Override, Sport Mode',
       'Automatic - 

* transmission column values not null
* grouping them in 2 categories
    * Automatic
    * Manual

In [198]:
#splitting the column by space
df_specs_A_Engine['transmission'] = df_specs_A_Engine['Transmission'].str.partition(' ')[0]

In [199]:
df_specs_A_Engine['transmission'].replace(['AMT','Automatic,','Clutchless'],'Automatic',inplace=True)

In [200]:
df_specs_A_Engine['transmission'].unique()

array(['Automatic', 'Manual'], dtype=object)

In [201]:
# deleting the original Transmission column
df_specs_A_Engine.drop(columns = ['Transmission'],inplace = True)

In [202]:
df_specs_A_Engine.head(2)

Unnamed: 0,profileId,vehicle,Engine,Engine Type,fuel,Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),transmission
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,"1197 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",4 cylinder inline petrol engine,Petrol,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,16.95,Automatic
1,D1982769,Audi A8 L 50 TDI,"2967 cc, 6 Cylinders In V Shape, 4 Valves/Cyli...",V6 diesel engine with common rail injection sy...,Diesel,247 bhp @ 4000 rpm,580 Nm @ 1750 rpm,16.77,Automatic


In [203]:
df_specs_A_Engine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58635 entries, 0 to 58634
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   profileId            58635 non-null  object 
 1   vehicle              58635 non-null  object 
 2   Engine               58300 non-null  object 
 3   Engine Type          53517 non-null  object 
 4   fuel                 58635 non-null  object 
 5   Max Power (bhp@rpm)  58071 non-null  object 
 6   Max Torque (Nm@rpm)  58071 non-null  object 
 7   Mileage (ARAI)       54078 non-null  float64
 8   transmission         58635 non-null  object 
dtypes: float64(1), object(8)
memory usage: 4.0+ MB


##### `Engine AND Engine Type`

In [204]:

df_specs_A_Engine['Engine'] = df_specs_A_Engine['Engine'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Engine[df_specs_A_Engine['Engine'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Engine[df_specs_A_Engine['Engine'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Engine[df_specs_A_Engine['Engine']==''].shape[0],'\n',
       "space - ",df_specs_A_Engine[df_specs_A_Engine['Engine']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Engine[df_specs_A_Engine['Engine']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Engine[df_specs_A_Engine['Engine']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Engine[df_specs_A_Engine['Engine']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Engine[df_specs_A_Engine['Engine']== '0'].shape[0],'\n'
     )


null -  335 
 nan -  335 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* There are 335 null values
* Checking if any other field could compensate the null values

In [205]:
df_specs_A_Engine.head()

Unnamed: 0,profileId,vehicle,Engine,Engine Type,fuel,Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),transmission
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,"1197 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",4 cylinder inline petrol engine,Petrol,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,16.95,Automatic
1,D1982769,Audi A8 L 50 TDI,"2967 cc, 6 Cylinders In V Shape, 4 Valves/Cyli...",V6 diesel engine with common rail injection sy...,Diesel,247 bhp @ 4000 rpm,580 Nm @ 1750 rpm,16.77,Automatic
2,D2136401,Honda Amaze 1.5 EX i-DTEC,"1498 cc, 4 Cylinders Inline, 4 Valves/Cylinder...","4-Cylinder, DOHC i-DTEC",Diesel,99 bhp @ 3600 rpm,200 Nm @ 1750 rpm,25.8,Manual
3,D2184679,Honda City S,"1497 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",,Petrol,117 bhp @ 6600 rpm,145 Nm @ 4600 rpm,17.8,Manual
4,D2184744,Hyundai Elite i20 Asta 1.4 (O) CRDi,"1396 cc, 4 Cylinders Inline, 4 Valves/Cylinder...","16 Valves, 4 Cylinder",Diesel,89 bhp @ 4000 rpm,220 Nm @ 1500 rpm,22.5,Manual


* Checking if Engine Type field is of any use

In [206]:
df_specs_A_Engine['Engine Type'].unique()

array(['4 cylinder inline petrol engine',
       'V6 diesel engine with common rail injection system and turbocharging',
       '4-Cylinder, DOHC i-DTEC', nan, '16 Valves, 4 Cylinder',
       '4 cylinder mHawk CRDe diesel engine',
       '4 cyl , 16 Valves,DOHC with VGT', '1.5L Ti-VCT (Petrol)',
       '2nd Gen 1.2 U2 CRDi Diesel', 'K10B',
       '2.5 Liter, 4-cylinder, 16 valve, DOHC, Turbo',
       'Multijet/DDis/DOHC', 'DDiS Diesel Engine', 'F10D Petrol',
       'mEagle - developed on NEF CRDe platform with Dual Pilot Injection and top mounted Intercooler',
       '1.5 dCI K9K HP Diesel engine', '2nd Gen 1.1 U2 CRDi Diesel',
       '8 V DOHC', 'V6 Petrol Engine with electric hybrid',
       '1.4 U II 6 Speed Manual Transmission',
       '1.5 dCI K9K THP Diesel engine',
       '2KD-FTV, Diesel With Turbocharger, DOHC', '4B 12 2.4 DOCH',
       'Water cooled, 4-stroke, common rail direct injection diesel with variable nozzle turbine turbocharger',
       'BMW TwinTurbo inline 6 Petrol

* no cc data was found in Engine Type and is not useful
* deleting this column

In [207]:
# taking a copy of the processed data for contingency
df_specs_A_Engine_3 = df_specs_A_Engine.copy()

In [208]:
# deleting engine type column

df_specs_A_Engine.drop(columns = ['Engine Type'],inplace = True)

In [209]:
# splitting the engine column values

df_specs_A_Engine['Engine'].str.split(pat=',',expand=True)

Unnamed: 0,0,1,2,3
0,1197 cc,4 Cylinders Inline,4 Valves/Cylinder,DOHC
1,2967 cc,6 Cylinders In V Shape,4 Valves/Cylinder,DOHC
2,1498 cc,4 Cylinders Inline,4 Valves/Cylinder,DOHC
3,1497 cc,4 Cylinders Inline,4 Valves/Cylinder,SOHC
4,1396 cc,4 Cylinders Inline,4 Valves/Cylinder,DOHC
...,...,...,...,...
58630,1197 cc,4 Cylinders Inline,4 Valves/Cylinder,DOHC
58631,1493 cc,3 Cylinders Inline,4 Valves/Cylinder,DOHC
58632,814 cc,3 Cylinders Inline,3 Valves/Cylinder,SOHC
58633,1497 cc,4 Cylinders Inline,4 Valves/Cylinder,SOHC


In [210]:
# splitting engine specifications are compased in 4 new columns into the main dataframe

df_specs_A_Engine[['Displacement','Cylinders','Valves','Camshaft']] = df_specs_A_Engine['Engine'].str.split(pat=',',expand=True)

In [211]:
df_specs_A_Engine.head(2)

Unnamed: 0,profileId,vehicle,Engine,fuel,Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),transmission,Displacement,Cylinders,Valves,Camshaft
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,"1197 cc, 4 Cylinders Inline, 4 Valves/Cylinder...",Petrol,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,16.95,Automatic,1197 cc,4 Cylinders Inline,4 Valves/Cylinder,DOHC
1,D1982769,Audi A8 L 50 TDI,"2967 cc, 6 Cylinders In V Shape, 4 Valves/Cyli...",Diesel,247 bhp @ 4000 rpm,580 Nm @ 1750 rpm,16.77,Automatic,2967 cc,6 Cylinders In V Shape,4 Valves/Cylinder,DOHC


In [212]:
# checking unique values in the displacement column for converting them to integers or floats
df_specs_A_Engine['Displacement'].unique()

array(['1197 cc', '2967 cc', '1498 cc', '1497 cc', '1396 cc', '2179 cc',
       '1582 cc', '1499 cc', '1186 cc', '998 cc', '2494 cc', '1248 cc',
       '1061 cc', '2498 cc', '1461 cc', '1120 cc', '4663 cc', '1399 cc',
       '2995 cc', '2360 cc', '2685 cc', '2979 cc', '1198 cc', '1086 cc',
       '2400 cc', '1364 cc', '1493 cc', '1968 cc', '2143 cc', '995 cc',
       '1984 cc', '1196 cc', '2953 cc', '1199 cc', '1462 cc', '2993 cc',
       '2523 cc', '1598 cc', '5998 cc', '1995 cc', '1591 cc', '1373 cc',
       '999 cc', '796 cc', '1451 cc', '1336 cc', '2987 cc', '2776 cc',
       nan, '5461 cc', '1991 cc', '1495 cc', '799 cc', '1896 cc',
       '1496 cc', '1150 cc', '2184 cc', '2489 cc', '1796 cc', '814 cc',
       '1799 cc', '1998 cc', '1956 cc', '1997 cc', '936 cc', '1999 cc',
       '1798 cc', '2755 cc', 'Not Applicable Cylinders Not Applicable',
       '2997 cc', '2393 cc', '3498 cc', '1368 cc', '1969 cc', '1395 cc',
       '1047 cc', '2696 cc', '2477 cc', '2982 cc', '3198 cc', '19

In [213]:
df_specs_A_Engine[df_specs_A_Engine['Displacement'] =='Not Applicable Cylinders Not Applicable'].head(2)

Unnamed: 0,profileId,vehicle,Engine,fuel,Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),transmission,Displacement,Cylinders,Valves,Camshaft
319,D3267965,MG ZS EV Exclusive [2020-2021],"Not Applicable Cylinders Not Applicable, Not A...",Electric,,,,Automatic,Not Applicable Cylinders Not Applicable,Not Applicable Valves/Cylinder,Not Applicable,
582,D3531205,Tata Nexon EV XZ Plus,"Not Applicable Cylinders Not Applicable, Not A...",Electric,,,,Automatic,Not Applicable Cylinders Not Applicable,Not Applicable Valves/Cylinder,Not Applicable,


In [214]:
# replacing the inconsitent string contents

df_specs_A_Engine['Displacement'].replace('Not Applicable Cylinders Not Applicable',0,inplace=True)

In [215]:
df_specs_A_Engine['Displacement'].unique()

array(['1197 cc', '2967 cc', '1498 cc', '1497 cc', '1396 cc', '2179 cc',
       '1582 cc', '1499 cc', '1186 cc', '998 cc', '2494 cc', '1248 cc',
       '1061 cc', '2498 cc', '1461 cc', '1120 cc', '4663 cc', '1399 cc',
       '2995 cc', '2360 cc', '2685 cc', '2979 cc', '1198 cc', '1086 cc',
       '2400 cc', '1364 cc', '1493 cc', '1968 cc', '2143 cc', '995 cc',
       '1984 cc', '1196 cc', '2953 cc', '1199 cc', '1462 cc', '2993 cc',
       '2523 cc', '1598 cc', '5998 cc', '1995 cc', '1591 cc', '1373 cc',
       '999 cc', '796 cc', '1451 cc', '1336 cc', '2987 cc', '2776 cc',
       nan, '5461 cc', '1991 cc', '1495 cc', '799 cc', '1896 cc',
       '1496 cc', '1150 cc', '2184 cc', '2489 cc', '1796 cc', '814 cc',
       '1799 cc', '1998 cc', '1956 cc', '1997 cc', '936 cc', '1999 cc',
       '1798 cc', '2755 cc', 0, '2997 cc', '2393 cc', '3498 cc',
       '1368 cc', '1969 cc', '1395 cc', '1047 cc', '2696 cc', '2477 cc',
       '2982 cc', '3198 cc', '1996 cc', '2835 cc', '1468 cc', '2199 cc',

In [216]:
# replacing the inconsitent string contents

df_specs_A_Engine['Displacement'].replace('4 Cylinders Inline',0,inplace=True)

In [217]:
df_specs_A_Engine[df_specs_A_Engine['Displacement'] =='4 Cylinders Inline'].shape

(0, 12)

In [218]:
# null or empty values

df_specs_A_Engine[df_specs_A_Engine['Displacement'].isna()].shape

(335, 12)

In [219]:
# to convert the cc values to float
# replacing the cc withing the displacement values to convert them into float

df_specs_A_Engine['Displacement'].replace(' cc','',regex=True,inplace=True)

In [220]:
df_specs_A_Engine['Displacement'].unique()

array(['1197', '2967', '1498', '1497', '1396', '2179', '1582', '1499',
       '1186', '998', '2494', '1248', '1061', '2498', '1461', '1120',
       '4663', '1399', '2995', '2360', '2685', '2979', '1198', '1086',
       '2400', '1364', '1493', '1968', '2143', '995', '1984', '1196',
       '2953', '1199', '1462', '2993', '2523', '1598', '5998', '1995',
       '1591', '1373', '999', '796', '1451', '1336', '2987', '2776', nan,
       '5461', '1991', '1495', '799', '1896', '1496', '1150', '2184',
       '2489', '1796', '814', '1799', '1998', '1956', '1997', '936',
       '1999', '1798', '2755', 0, '2997', '2393', '3498', '1368', '1969',
       '1395', '1047', '2696', '2477', '2982', '3198', '1996', '2835',
       '1468', '2199', '624', '1950', '1405', '4461', '1586', '2198',
       '4134', '2354', '2925', '1597', '1298', '2996', '3956', '1353',
       '1797', '2998', '3200', '3628', '2157', '2497', '3192', '1794',
       '2956', '1330', '2694', '1986', '1988', '2698', '5204', '1490',
      

In [221]:
# replacing 0 values with np.nan
df_specs_A_Engine['Displacement'].replace(0,np.nan,inplace=True)

In [222]:
df_specs_A_Engine['Displacement'].unique()

array(['1197', '2967', '1498', '1497', '1396', '2179', '1582', '1499',
       '1186', '998', '2494', '1248', '1061', '2498', '1461', '1120',
       '4663', '1399', '2995', '2360', '2685', '2979', '1198', '1086',
       '2400', '1364', '1493', '1968', '2143', '995', '1984', '1196',
       '2953', '1199', '1462', '2993', '2523', '1598', '5998', '1995',
       '1591', '1373', '999', '796', '1451', '1336', '2987', '2776', nan,
       '5461', '1991', '1495', '799', '1896', '1496', '1150', '2184',
       '2489', '1796', '814', '1799', '1998', '1956', '1997', '936',
       '1999', '1798', '2755', '2997', '2393', '3498', '1368', '1969',
       '1395', '1047', '2696', '2477', '2982', '3198', '1996', '2835',
       '1468', '2199', '624', '1950', '1405', '4461', '1586', '2198',
       '4134', '2354', '2925', '1597', '1298', '2996', '3956', '1353',
       '1797', '2998', '3200', '3628', '2157', '2497', '3192', '1794',
       '2956', '1330', '2694', '1986', '1988', '2698', '5204', '1490',
       '2

In [223]:
# converting the displacement to float
df_specs_A_Engine['Displacement'] = df_specs_A_Engine['Displacement'].astype('float',copy=False)

In [224]:
df_specs_A_Engine['Displacement']

0        1197.0
1        2967.0
2        1498.0
3        1497.0
4        1396.0
          ...  
58630    1197.0
58631    1493.0
58632     814.0
58633    1497.0
58634     814.0
Name: Displacement, Length: 58635, dtype: float64

- cylinders, valve and camshaft column values are not needed for analysis
- deleting these columns as well as Engine column

In [225]:
df_specs_A_Engine['Cylinders'].unique()

array([' 4 Cylinders Inline', ' 6 Cylinders In V Shape',
       ' 3 Cylinders Inline', ' 4 Cylinders 4 Valves/Cylinder',
       ' 8 Cylinders In V Shape', ' 6 Cylinders Flat',
       ' 5 Cylinders Inline', ' 6 Cylinders Inline',
       ' 3 Cylinders 4 Valves/Cylinder', ' 12 Cylinders In W Shape', None,
       ' 6 Cylinders 4 Valves/Cylinder', nan, ' 4 Cylinders',
       ' 2 Valves/Cylinder', ' Not Applicable Valves/Cylinder',
       ' 4 Cylinders In V Shape', ' 2 Cylinders',
       ' 8 Cylinders 4 Valves/Cylinder', ' 3 Cylinders 3 Valves/Cylinder',
       ' 4 Cylinders 2 Valves/Cylinder', ' 4 Cylinders Flat',
       ' 4 Valves/Cylinder', ' 10 Cylinders In V Shape',
       ' 12 Cylinders In V Shape', ' 2 Cylinders Inline', ' In V Shape',
       ' 2 Cylinders In V Shape', ' 4 Cylinders SOHC', ' Inline', ' DOHC',
       ' 4 Cylinders DOHC', ' 8 Cylinders Inline'], dtype=object)

In [226]:
df_specs_A_Engine['Valves'].unique()

array([' 4 Valves/Cylinder', None, ' 2 Valves/Cylinder',
       ' 3 Valves/Cylinder', nan, ' DOHC', ' Not Applicable', ' SOHC',
       ' 6 Valves/Cylinder', ' 5 Valves/Cylinder'], dtype=object)

In [227]:
df_specs_A_Engine['Camshaft'].unique()

array([' DOHC', ' SOHC', None, nan], dtype=object)

In [228]:
df_specs_A_Engine.drop(columns = ['Cylinders','Valves','Camshaft','Engine'],inplace = True)

In [229]:
df_specs_A_Engine.head(2)

Unnamed: 0,profileId,vehicle,fuel,Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),transmission,Displacement
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,Petrol,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,16.95,Automatic,1197.0
1,D1982769,Audi A8 L 50 TDI,Diesel,247 bhp @ 4000 rpm,580 Nm @ 1750 rpm,16.77,Automatic,2967.0


##### `Max Power (bhp@rpm) AND Max Torque (Nm@rpm)`

In [230]:

df_specs_A_Engine['Max Power (bhp@rpm)'] = df_specs_A_Engine['Max Power (bhp@rpm)'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Engine[df_specs_A_Engine['Max Power (bhp@rpm)'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Engine[df_specs_A_Engine['Max Power (bhp@rpm)'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Engine[df_specs_A_Engine['Max Power (bhp@rpm)']==''].shape[0],'\n',
       "space - ",df_specs_A_Engine[df_specs_A_Engine['Max Power (bhp@rpm)']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Engine[df_specs_A_Engine['Max Power (bhp@rpm)']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Engine[df_specs_A_Engine['Max Power (bhp@rpm)']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Engine[df_specs_A_Engine['Max Power (bhp@rpm)']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Engine[df_specs_A_Engine['Max Power (bhp@rpm)']== '0'].shape[0],'\n'
     )


null -  564 
 nan -  564 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [231]:
df_specs_A_Engine['Max Torque (Nm@rpm)'] = df_specs_A_Engine['Max Torque (Nm@rpm)'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Engine[df_specs_A_Engine['Max Torque (Nm@rpm)'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Engine[df_specs_A_Engine['Max Torque (Nm@rpm)'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Engine[df_specs_A_Engine['Max Torque (Nm@rpm)']==''].shape[0],'\n',
       "space - ",df_specs_A_Engine[df_specs_A_Engine['Max Torque (Nm@rpm)']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Engine[df_specs_A_Engine['Max Torque (Nm@rpm)']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Engine[df_specs_A_Engine['Max Torque (Nm@rpm)']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Engine[df_specs_A_Engine['Max Torque (Nm@rpm)']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Engine[df_specs_A_Engine['Max Torque (Nm@rpm)']== '0'].shape[0],'\n'
     )


null -  564 
 nan -  564 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* there are some null values in both torque and power
* splitting the column values to get the torque and power

In [232]:
# splitting power data into separate columns
df_specs_A_Engine[['maxPower_(bhp)','maxPowerrpm']] = df_specs_A_Engine['Max Power (bhp@rpm)'].str.split(pat="@",expand=True)

In [233]:
# splitting torque data into separate columns
df_specs_A_Engine[['maxTorque_(Nm)','maxTorquerpm']] = df_specs_A_Engine['Max Torque (Nm@rpm)'].str.split(pat="@",expand=True)

In [234]:
df_specs_A_Engine.head(2).iloc[:,7:]

Unnamed: 0,Displacement,maxPower_(bhp),maxPowerrpm,maxTorque_(Nm),maxTorquerpm
0,1197.0,79 bhp,6000 rpm,112.7619 Nm,4000 rpm
1,2967.0,247 bhp,4000 rpm,580 Nm,1750 rpm


In [235]:
df_specs_A_Engine['maxPower_(bhp)'].replace(' bhp','',regex=True, inplace=True)

In [236]:
df_specs_A_Engine['maxPower_(bhp)'] = df_specs_A_Engine['maxPower_(bhp)'].str.strip()

In [237]:
df_specs_A_Engine['maxPower_(bhp)'].replace('',0,regex=True,inplace=True)

In [238]:
df_specs_A_Engine['maxPower_(bhp)'] = df_specs_A_Engine['maxPower_(bhp)'].astype('float',copy=False)

In [239]:
# checking is null values remain the same
df_specs_A_Engine[df_specs_A_Engine['maxPower_(bhp)'].isnull()].shape

(564, 12)

In [648]:
df_specs_A_Engine['maxPowerrpm'] = df_specs_A_Engine['maxPowerrpm'].str.strip().str.slice(0,4).replace('',0).astype('float',copy=True)

* Torque

In [240]:
df_specs_A_Engine['maxTorque_(Nm)'].replace(' Nm','',regex=True,inplace=True)

In [241]:
df_specs_A_Engine['maxTorque_(Nm)'] = df_specs_A_Engine['maxTorque_(Nm)'].str.strip()

In [242]:
df_specs_A_Engine['maxTorque_(Nm)'].replace('',0,regex=True,inplace=True)

In [243]:
df_specs_A_Engine['maxTorque_(Nm)'] = df_specs_A_Engine['maxTorque_(Nm)'].astype('float',copy=False)

In [651]:
df_specs_A_Engine['maxTorquerpm'] = df_specs_A_Engine['maxTorquerpm'].str.strip().str.slice(0,4).replace('',0).astype('float',copy=True)

In [653]:
df_specs_A_Engine.head(2).iloc[:,6:]

Unnamed: 0,maxPower_(bhp),maxPowerrpm,maxTorque_(Nm),maxTorquerpm
0,79.0,6000.0,112.7619,4000.0
1,247.0,4000.0,580.0,1750.0


In [654]:
# checking is null values remain the same
df_specs_A_Engine[df_specs_A_Engine['maxTorque_(Nm)'].isnull()].shape

(564, 10)

In [655]:
df_specs_A_Engine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58635 entries, 0 to 58634
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   profileId       58635 non-null  object 
 1   vehicle         58635 non-null  object 
 2   fuel            58635 non-null  object 
 3   Mileage (ARAI)  54078 non-null  float64
 4   transmission    58635 non-null  object 
 5   Displacement    58058 non-null  float64
 6   maxPower_(bhp)  58071 non-null  float64
 7   maxPowerrpm     58067 non-null  float64
 8   maxTorque_(Nm)  58071 non-null  float64
 9   maxTorquerpm    58059 non-null  float64
dtypes: float64(6), object(4)
memory usage: 4.5+ MB


##### FINAL ENGINE DATASET

In [247]:
df_specs_A_Engine_1 = df_specs_A_Engine.copy()

In [248]:
df_specs_A_Engine.head(2)

Unnamed: 0,profileId,vehicle,fuel,Max Power (bhp@rpm),Max Torque (Nm@rpm),Mileage (ARAI),transmission,Displacement,maxPower_(bhp),maxPowerrpm,maxTorque_(Nm),maxTorquerpm
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,Petrol,79 bhp @ 6000 rpm,112.7619 Nm @ 4000 rpm,16.95,Automatic,1197.0,79.0,6000 rpm,112.7619,4000 rpm
1,D1982769,Audi A8 L 50 TDI,Diesel,247 bhp @ 4000 rpm,580 Nm @ 1750 rpm,16.77,Automatic,2967.0,247.0,4000 rpm,580.0,1750 rpm


In [249]:
df_specs_A_Engine.drop(columns = ['Max Power (bhp@rpm)','Max Torque (Nm@rpm)'],inplace=True)

In [250]:
df_specs_A_Engine.head()

Unnamed: 0,profileId,vehicle,fuel,Mileage (ARAI),transmission,Displacement,maxPower_(bhp),maxPowerrpm,maxTorque_(Nm),maxTorquerpm
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,Petrol,16.95,Automatic,1197.0,79.0,6000 rpm,112.7619,4000 rpm
1,D1982769,Audi A8 L 50 TDI,Diesel,16.77,Automatic,2967.0,247.0,4000 rpm,580.0,1750 rpm
2,D2136401,Honda Amaze 1.5 EX i-DTEC,Diesel,25.8,Manual,1498.0,99.0,3600 rpm,200.0,1750 rpm
3,D2184679,Honda City S,Petrol,17.8,Manual,1497.0,117.0,6600 rpm,145.0,4600 rpm
4,D2184744,Hyundai Elite i20 Asta 1.4 (O) CRDi,Diesel,22.5,Manual,1396.0,89.0,4000 rpm,220.0,1500 rpm


In [251]:
df_specs_A_Engine.to_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-II\specs_Engine_A.csv',index=False)

#### `Dimensions & Weight`

In [252]:
#looking at the data
df_specs_A[df_specs_A['spec_category'] == 'Dimensions & Weight'].head()

Unnamed: 0,specName,specValue,specUnit,spec_category,profileId,vehicle
15,Length,4924,mm,Dimensions & Weight,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
16,Width,2157,mm,Dimensions & Weight,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
17,Height,1772,mm,Dimensions & Weight,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
18,Wheelbase,2995,mm,Dimensions & Weight,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
19,Ground Clearance,215,mm,Dimensions & Weight,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB


In [253]:
df_specs_A_Dimensions = df_specs_A[df_specs_A['spec_category'] == 'Dimensions & Weight'].copy()

##### PIVOTING TABLE

In [254]:
df_specs_A_Dimensions_1 = df_specs_A_Dimensions.copy()

In [255]:
df_specs_A_Dimensions = df_specs_A_Dimensions.pivot(index =['profileId','vehicle'], columns = 'specName', values ='specValue')

In [256]:
df_specs_A_Dimensions.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 58423 entries, ('D1820959', 'Hyundai i10 Sportz 1.2 AT Kappa2') to ('S2819375', 'Hyundai Eon Era +')
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Ground Clearance  44174 non-null  object
 1   Height            58423 non-null  object
 2   Kerb Weight       36335 non-null  object
 3   Length            58423 non-null  object
 4   Wheelbase         58421 non-null  object
 5   Width             58423 non-null  object
dtypes: object(6)
memory usage: 5.6+ MB


In [257]:
df_specs_A_Dimensions.head(2)

Unnamed: 0_level_0,specName,Ground Clearance,Height,Kerb Weight,Length,Wheelbase,Width
profileId,vehicle,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,165.0,1550,,3585,2380,1595
D1982769,Audi A8 L 50 TDI,,1471,2010.0,5265,3122,1949


In [258]:
# resetting index
df_specs_A_Dimensions.reset_index(inplace = True)

In [259]:
df_specs_A_Dimensions.index

RangeIndex(start=0, stop=58423, step=1)

In [260]:
# renaming the axis name after resetting
df_specs_A_Dimensions.rename_axis('',axis = 1,inplace=True)

In [261]:
df_specs_A_Dimensions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58423 entries, 0 to 58422
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   profileId         58423 non-null  object
 1   vehicle           58423 non-null  object
 2   Ground Clearance  44174 non-null  object
 3   Height            58423 non-null  object
 4   Kerb Weight       36335 non-null  object
 5   Length            58423 non-null  object
 6   Wheelbase         58421 non-null  object
 7   Width             58423 non-null  object
dtypes: object(8)
memory usage: 3.6+ MB


In [262]:
# unique count of values of each column
df_specs_A_Dimensions.nunique()


profileId           58423
vehicle              5113
Ground Clearance      102
Height                349
Kerb Weight           663
Length                470
Wheelbase             249
Width                 271
dtype: int64

In [263]:
df_specs_A_Dimensions[df_specs_A_Dimensions.notnull()]

Unnamed: 0,profileId,vehicle,Ground Clearance,Height,Kerb Weight,Length,Wheelbase,Width
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,165,1550,,3585,2380,1595
1,D1982769,Audi A8 L 50 TDI,,1471,2010,5265,3122,1949
2,D2136401,Honda Amaze 1.5 EX i-DTEC,165,1505,1065,3990,2405,1680
3,D2184679,Honda City S,165,1495,1041,4440,2600,1695
4,D2184744,Hyundai Elite i20 Asta 1.4 (O) CRDi,170,1505,,3985,2570,1734
...,...,...,...,...,...,...,...,...
58418,S2819197,Hyundai Xcent S 1.2 [2014-2016],165,1520,,3995,2425,1660
58419,S2819213,Mahindra Bolero Neo Limited Edition [2023],180,1817,,3995,2680,1795
58420,S2819265,Hyundai Eon D-Lite +,170,1500,715,3495,2380,1550
58421,S2819349,Honda City VX CVT,165,1495,1085,4440,2600,1695


In [264]:
df_specs_A_Dimensions['Ground Clearance'].unique()

array(['165', nan, '170', '200', '205', '168', '215', '190', '172', '136',
       '175', '184', '163', '180', '133', '160', '212', '210', '152',
       '149', '154', '185', '192', '158', '183', '241', '198', '174',
       '204.8', '164', '144', '226', '188', '145', '157', '120', '155',
       '220', '209', '214', '161', '176', '179', '295.5', '139', '211',
       '195', '141', '208', '225', '217', '167', '239.8', '159', '150',
       '201', '216', '213', '135', '244', '110', '202', '142', '138',
       '230', '238', '100', '189', '171', '221', '182', '223', '219',
       '117', '187', '156', '151', '218', '137', '186', '196', '204',
       '112', '126', '118', '128', '116', '227', '181', '134', '206',
       '109', '140', '130', '197', '147', '113', '114', '162', '178',
       '177', '191', '235'], dtype=object)

* analysing data by column

##### `Ground Clearance`

In [265]:

df_specs_A_Dimensions['Ground Clearance'] = df_specs_A_Dimensions['Ground Clearance'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Ground Clearance'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Ground Clearance'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Ground Clearance']==''].shape[0],'\n',
       "space - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Ground Clearance']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Ground Clearance']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Ground Clearance']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Ground Clearance']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Ground Clearance']== '0'].shape[0],'\n'
     )


null -  14249 
 nan -  14249 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* huge null values can be seen
* we will be using this for analysis

In [266]:
df_specs_A_Dimensions['Ground Clearance'].unique()

array(['165', nan, '170', '200', '205', '168', '215', '190', '172', '136',
       '175', '184', '163', '180', '133', '160', '212', '210', '152',
       '149', '154', '185', '192', '158', '183', '241', '198', '174',
       '204.8', '164', '144', '226', '188', '145', '157', '120', '155',
       '220', '209', '214', '161', '176', '179', '295.5', '139', '211',
       '195', '141', '208', '225', '217', '167', '239.8', '159', '150',
       '201', '216', '213', '135', '244', '110', '202', '142', '138',
       '230', '238', '100', '189', '171', '221', '182', '223', '219',
       '117', '187', '156', '151', '218', '137', '186', '196', '204',
       '112', '126', '118', '128', '116', '227', '181', '134', '206',
       '109', '140', '130', '197', '147', '113', '114', '162', '178',
       '177', '191', '235'], dtype=object)

In [267]:
df_specs_A_Dimensions['Ground Clearance'] = df_specs_A_Dimensions['Ground Clearance'].astype('float64',copy=True)

In [268]:
df_specs_A_Dimensions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58423 entries, 0 to 58422
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   profileId         58423 non-null  object 
 1   vehicle           58423 non-null  object 
 2   Ground Clearance  44174 non-null  float64
 3   Height            58423 non-null  object 
 4   Kerb Weight       36335 non-null  object 
 5   Length            58423 non-null  object 
 6   Wheelbase         58421 non-null  object 
 7   Width             58423 non-null  object 
dtypes: float64(1), object(7)
memory usage: 3.6+ MB


##### `Height`

In [269]:
df_specs_A_Dimensions['Height'] = df_specs_A_Dimensions['Height'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Height'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Height'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Height']==''].shape[0],'\n',
       "space - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Height']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Height']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Height']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Height']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Height']== '0'].shape[0],'\n'
     )


null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [270]:
df_specs_A_Dimensions['Height'].unique()

array(['1550', '1471', '1505', '1495', '1785', '1470', '1708', '1520',
       '1670', '1755', '1530', '1555', '1700', '1595', '1895', '1695',
       '1494', '1427', '1705', '1680', '1484', '1635', '1590', '1647',
       '1560', '1480', '1817', '1474', '1515', '1510', '1796', '1690',
       '1645', '1880', '1774', '1518', '1458', '1800', '1995', '1724',
       '1475', '1826', '1544', '1485', '1481', '1672', '1500', '1467',
       '1466', '1455', '1404', '1464', '1685', '1478', '1460', '1760',
       '1482', '1930', '1570', '1522', '1975', '1653', '1780', '1838',
       '1669', '1640', '1850', '1535', '1839', '1665', '1938', '1523',
       '1691', '1922', '1479', '1370', '1737', '1483', '1416', '1425',
       '1844', '1601', '1564', '1447', '1525', '1456', '1608', '1675',
       '1429', '1450', '1477', '1630', '1620', '1420', '1453', '1639',
       '1925', '1619', '1498', '1476', '1605', '1835', '1698', '1795',
       '1545', '1405', '1490', '1473', '1840', '1659', '1846', '1607',
      

In [271]:
df_specs_A_Dimensions['Height'] = df_specs_A_Dimensions['Height'].astype('float64',copy=True)

In [272]:
df_specs_A_Dimensions['Height'].head()

0    1550.0
1    1471.0
2    1505.0
3    1495.0
4    1505.0
Name: Height, dtype: float64

In [273]:
df_specs_A_Dimensions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58423 entries, 0 to 58422
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   profileId         58423 non-null  object 
 1   vehicle           58423 non-null  object 
 2   Ground Clearance  44174 non-null  float64
 3   Height            58423 non-null  float64
 4   Kerb Weight       36335 non-null  object 
 5   Length            58423 non-null  object 
 6   Wheelbase         58421 non-null  object 
 7   Width             58423 non-null  object 
dtypes: float64(2), object(6)
memory usage: 3.6+ MB


##### `Kerb Weight`

In [274]:

df_specs_A_Dimensions['Kerb Weight'] = df_specs_A_Dimensions['Kerb Weight'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Kerb Weight'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Kerb Weight'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Kerb Weight']==''].shape[0],'\n',
       "space - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Kerb Weight']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Kerb Weight']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Kerb Weight']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Kerb Weight']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Kerb Weight']== '0'].shape[0],'\n'
     )


null -  22088 
 nan -  22088 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [275]:
df_specs_A_Dimensions['Kerb Weight'].unique()

array([nan, '2010', '1065', '1041', '1329', '1241', '1025', '895', '1060',
       '885', '1260', '2200', '1130', '2315', '1272', '1976', '1135',
       '820', '1304', '1551', '1108', '1270', '925', '1620', '1595',
       '1055', '1800', '898', '890', '2175', '1680', '1587', '928',
       '2014', '960', '1035', '955', '1024', '1029', '1950', '1720',
       '880', '930', '810', '1020', '1233', '1220', '1585', '2350', '845',
       '1145', '1095', '1175', '1608', '1170', '1510', '1152', '1830',
       '1125', '1656', '2115', '1099', '2968', '2075', '1190', '1920',
       '965', '1058', '950', '990', '970', '1015', '1122', '2550', '1022',
       '1625', '865', '1300', '2455', '1525', '1543', '935', '1845',
       '2000', '2005', '1785', '980', '2345', '1049', '1565', '1340',
       '1560', '1180', '1211', '1205', '1635', '1171', '1176', '763',
       '1040', '1550', '1660', '795', '1210', '915', '1150', '1540',
       '1184', '1584', '1050', '1063', '1178', '825', '1165', '1265',
       '1

In [276]:
df_specs_A_Dimensions['Kerb Weight'] = df_specs_A_Dimensions['Kerb Weight'].astype('float64',copy=True)

In [277]:
df_specs_A_Dimensions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58423 entries, 0 to 58422
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   profileId         58423 non-null  object 
 1   vehicle           58423 non-null  object 
 2   Ground Clearance  44174 non-null  float64
 3   Height            58423 non-null  float64
 4   Kerb Weight       36335 non-null  float64
 5   Length            58423 non-null  object 
 6   Wheelbase         58421 non-null  object 
 7   Width             58423 non-null  object 
dtypes: float64(3), object(5)
memory usage: 3.6+ MB


##### `Length`

In [278]:


df_specs_A_Dimensions['Length'] = df_specs_A_Dimensions['Length'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Length'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Length'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Length']==''].shape[0],'\n',
       "space - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Length']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Length']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Length']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Length']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Length']== '0'].shape[0],'\n'
     )



null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [279]:
df_specs_A_Dimensions['Length'].unique()

array(['3585', '5265', '3990', '4440', '3985', '4585', '4530', '3999',
       '3765', '3636', '4555', '4160', '3995', '3599', '3495', '4520',
       '4315', '5246', '3795', '4846', '4580', '4640', '4818', '5212',
       '3675', '3565', '3998', '4635', '3610', '4540', '3539', '3595',
       '4701', '4879', '4804', '3520', '4370', '4355', '4456', '4600',
       '5060', '3850', '3955', '4490', '5219', '4486', '3825', '3695',
       '4107', '4390', '4384', '4915', '4806', '3600', '4386', '4899',
       '4375', '4296', '3679', '4655', '4838', '4000', '4569', '4430',
       '4933', '4629', '4781', '4417', '3940', '4878', '4877', '5120',
       '3746', '4270', '4265', '4763', '4425', '4825', '5130', '4424',
       '4223', '5226', '4250', '3775', '5089', '4861', '4936', '3982',
       '3880', '4850', '4596', '4767', '4385', '3370', '3655', '3886',
       '4624', '4545', '4310', '4597', '4395', '4633', '3989', '3700',
       '4300', '4969', '3640', '3970', '4656', '4650', '4480', '3690',
      

In [280]:
df_specs_A_Dimensions['Length'] = df_specs_A_Dimensions['Length'].astype('float64',copy=True)

In [281]:
df_specs_A_Dimensions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58423 entries, 0 to 58422
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   profileId         58423 non-null  object 
 1   vehicle           58423 non-null  object 
 2   Ground Clearance  44174 non-null  float64
 3   Height            58423 non-null  float64
 4   Kerb Weight       36335 non-null  float64
 5   Length            58423 non-null  float64
 6   Wheelbase         58421 non-null  object 
 7   Width             58423 non-null  object 
dtypes: float64(4), object(4)
memory usage: 3.6+ MB


##### `Wheelbase`

In [282]:


df_specs_A_Dimensions['Wheelbase'] = df_specs_A_Dimensions['Wheelbase'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Wheelbase'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Wheelbase'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Wheelbase']==''].shape[0],'\n',
       "space - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Wheelbase']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Wheelbase']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Wheelbase']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Wheelbase']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Wheelbase']== '0'].shape[0],'\n'
     )



null -  2 
 nan -  2 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [283]:
df_specs_A_Dimensions['Wheelbase'].unique()

array(['2380', '3122', '2405', '2600', '2570', '2700', '2520', '2425',
       '2400', '2750', '2390', '2430', '2360', '2760', '2673', '3165',
       '2489', '2895', '2670', '2854', '3210', '2385', '2519', '2776',
       '2680', '2808', '2874', '2450', '2915', '2610', '2345', '2646',
       '2350', '2660', '2741', '2860', '2530', '2650', '2677', '2553',
       '2552', '2912', '2746', '2968', '2740', '2422', '2761', '2550',
       '2470', '2525', '2465', '2578', '2807', '2699', '2923', '2845',
       '2933', '2500', '3075', '2590', '2850', '2775', '2501', '2440',
       '2460', '3002', '2841', '2637', '2975', '2580', '2567', '2480',
       '2555', '2491', '2786', '2603', '1840', '2435', '2960', '2810',
       '2685', '2636', '2914', '2375', '2456', '2873', '2502', '2688',
       '2995', '2745', '2782', '2585', '2175', '3035', '3120', '3215',
       '2922', '2498', '2920', '2702', '2467', '2679', '2820', '2512',
       '2994', '2864', '2510', '2835', '3000', '2800', '2469', '2725',
      

In [284]:
df_specs_A_Dimensions['Wheelbase'] = df_specs_A_Dimensions['Wheelbase'].astype('float64',copy=True)

In [285]:
df_specs_A_Dimensions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58423 entries, 0 to 58422
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   profileId         58423 non-null  object 
 1   vehicle           58423 non-null  object 
 2   Ground Clearance  44174 non-null  float64
 3   Height            58423 non-null  float64
 4   Kerb Weight       36335 non-null  float64
 5   Length            58423 non-null  float64
 6   Wheelbase         58421 non-null  float64
 7   Width             58423 non-null  object 
dtypes: float64(5), object(3)
memory usage: 3.6+ MB


##### `Width`

In [286]:


df_specs_A_Dimensions['Width'] = df_specs_A_Dimensions['Width'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Width'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Width'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Width']==''].shape[0],'\n',
       "space - ",df_specs_A_Dimensions[df_specs_A_Dimensions['Width']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Width']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Dimensions[df_specs_A_Dimensions['Width']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Width']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Dimensions[df_specs_A_Dimensions['Width']== '0'].shape[0],'\n'
     )



null -  0 
 nan -  0 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [287]:
df_specs_A_Dimensions['Width'].unique()

array(['1595', '1949', '1680', '1695', '1734', '1890', '1775', '1765',
       '1660', '1475', '1770', '1690', '1495', '1850', '1822', '1899',
       '2155', '1760', '1800', '2134', '1715', '1525', '1865', '1645',
       '1795', '1826', '1854', '1735', '1745', '2141', '1490', '1866',
       '1783', '2120', '1820', '2069', '1700', '1788', '1694', '1730',
       '1902', '1729', '1839', '1665', '1600', '1699', '1874', '1943',
       '2094', '1579', '1835', '1817', '1706', '1710', '1642', '1769',
       '1898', '1911', '1804', '2073', '1983', '1790', '1934', '1647',
       '1780', '1855', '1825', '1755', '1793', '1965', '1670', '1864',
       '1796', '1868', '1727', '1520', '1832', '1831', '1410', '1620',
       '1550', '2031', '1750', '1682', '1818', '1811', '2139', '1918',
       '1813', '1721', '1814', '1995', '1859', '1809', '1830', '2044',
       '1440', '1871', '2220', '2183', '1777', '1635', '1828', '1863',
       '1913', '1842', '1731', '1968', '1881', '1903', '1938', '1687',
      

In [288]:
df_specs_A_Dimensions['Width'] = df_specs_A_Dimensions['Width'].astype('float64',copy=True)

In [289]:
df_specs_A_Dimensions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58423 entries, 0 to 58422
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   profileId         58423 non-null  object 
 1   vehicle           58423 non-null  object 
 2   Ground Clearance  44174 non-null  float64
 3   Height            58423 non-null  float64
 4   Kerb Weight       36335 non-null  float64
 5   Length            58423 non-null  float64
 6   Wheelbase         58421 non-null  float64
 7   Width             58423 non-null  float64
dtypes: float64(6), object(2)
memory usage: 3.6+ MB


##### FINAL DIMENSIONS DATASET

In [290]:
df_specs_A_Dimensions_1 = df_specs_A_Dimensions.copy()

In [291]:
df_specs_A_Dimensions.columns

Index(['profileId', 'vehicle', 'Ground Clearance', 'Height', 'Kerb Weight',
       'Length', 'Wheelbase', 'Width'],
      dtype='object', name='')

In [292]:
rename_col = {}
for i in df_specs_A_Dimensions.columns:
    rename_col.update({i:i + '_(mm)'})

In [293]:
rename_col

{'profileId': 'profileId_(mm)',
 'vehicle': 'vehicle_(mm)',
 'Ground Clearance': 'Ground Clearance_(mm)',
 'Height': 'Height_(mm)',
 'Kerb Weight': 'Kerb Weight_(mm)',
 'Length': 'Length_(mm)',
 'Wheelbase': 'Wheelbase_(mm)',
 'Width': 'Width_(mm)'}

In [294]:
df_specs_A_Dimensions.rename(columns = {'profileId': 'profileId',
                             'vehicle': 'vehicle',
                             'Ground Clearance': 'Ground_Clearance_(mm)',
                             'Height': 'Height_(mm)',
                             'Kerb Weight': 'Kerb_Weight_(mm)',
                             'Length': 'Length_(mm)',
                             'Wheelbase': 'Wheelbase_(mm)',
                             'Width': 'Width_(mm)'},inplace=True)

In [295]:
df_specs_A_Dimensions.head(2)

Unnamed: 0,profileId,vehicle,Ground_Clearance_(mm),Height_(mm),Kerb_Weight_(mm),Length_(mm),Wheelbase_(mm),Width_(mm)
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,165.0,1550.0,,3585.0,2380.0,1595.0
1,D1982769,Audi A8 L 50 TDI,,1471.0,2010.0,5265.0,3122.0,1949.0


In [296]:
df_specs_A_Dimensions.to_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-II\specs_Dimensions_A.csv',index=False)

#### `Capacity`

In [297]:
#looking at the data
df_specs_A[df_specs_A['spec_category'] == 'Capacity'].head()

Unnamed: 0,specName,specValue,specUnit,spec_category,profileId,vehicle
20,Doors,5,Doors,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
21,Seating Capacity,5,Person,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
22,No of Seating Rows,2,Rows,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
23,Bootspace,630,litres,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
24,Fuel Tank Capacity,93,litres,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB


In [298]:
df_specs_A_Capacity = df_specs_A[df_specs_A['spec_category'] == 'Capacity'].copy()

In [299]:
df_specs_A_Capacity.head(2)

Unnamed: 0,specName,specValue,specUnit,spec_category,profileId,vehicle
20,Doors,5,Doors,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB
21,Seating Capacity,5,Person,Capacity,S2725575,Mercedes-Benz GLE 300d 4MATIC LWB


##### PIVOTING TABLE

In [300]:
df_specs_A_Capacity_1 = df_specs_A_Capacity.copy()

In [301]:
df_specs_A_Capacity = df_specs_A_Capacity.pivot(index =['profileId','vehicle'], columns = 'specName', values ='specValue')

In [302]:
df_specs_A_Capacity.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 58423 entries, ('D1820959', 'Hyundai i10 Sportz 1.2 AT Kappa2') to ('S2819375', 'Hyundai Eon Era +')
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Bootspace           42381 non-null  object
 1   Doors               58415 non-null  object
 2   Fuel Tank Capacity  57408 non-null  object
 3   No of Seating Rows  52199 non-null  object
 4   Seating Capacity    58422 non-null  object
dtypes: object(5)
memory usage: 5.2+ MB


In [303]:
df_specs_A_Capacity.head(2)

Unnamed: 0_level_0,specName,Bootspace,Doors,Fuel Tank Capacity,No of Seating Rows,Seating Capacity
profileId,vehicle,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,,5,35,2,5
D1982769,Audi A8 L 50 TDI,510.0,4,82,2,4


In [304]:
# resetting index
df_specs_A_Capacity.reset_index(inplace = True)

In [305]:
df_specs_A_Capacity.index

RangeIndex(start=0, stop=58423, step=1)

In [306]:
# renaming the axis name after resetting
df_specs_A_Capacity.rename_axis('',axis = 1,inplace=True)

In [307]:
df_specs_A_Capacity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58423 entries, 0 to 58422
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   profileId           58423 non-null  object
 1   vehicle             58423 non-null  object
 2   Bootspace           42381 non-null  object
 3   Doors               58415 non-null  object
 4   Fuel Tank Capacity  57408 non-null  object
 5   No of Seating Rows  52199 non-null  object
 6   Seating Capacity    58422 non-null  object
dtypes: object(7)
memory usage: 3.1+ MB


In [308]:
# unique count of values of each column
df_specs_A_Capacity.nunique()


profileId             58423
vehicle                5112
Bootspace               208
Doors                     4
Fuel Tank Capacity       80
No of Seating Rows        4
Seating Capacity         10
dtype: int64

In [309]:
df_specs_A_Capacity[df_specs_A_Capacity.notnull()]

Unnamed: 0,profileId,vehicle,Bootspace,Doors,Fuel Tank Capacity,No of Seating Rows,Seating Capacity
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,,5,35,2,5
1,D1982769,Audi A8 L 50 TDI,510,4,82,2,4
2,D2136401,Honda Amaze 1.5 EX i-DTEC,400,4,35,2,5
3,D2184679,Honda City S,510,4,40,2,5
4,D2184744,Hyundai Elite i20 Asta 1.4 (O) CRDi,285,5,40,2,5
...,...,...,...,...,...,...,...
58418,S2819197,Hyundai Xcent S 1.2 [2014-2016],407,4,43,2,5
58419,S2819213,Mahindra Bolero Neo Limited Edition [2023],384,5,50,3,7
58420,S2819265,Hyundai Eon D-Lite +,215,5,32,2,5
58421,S2819349,Honda City VX CVT,510,4,40,2,5


##### `Bootspace`

In [310]:

df_specs_A_Capacity['Bootspace'] = df_specs_A_Capacity['Bootspace'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Capacity[df_specs_A_Capacity['Bootspace'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Capacity[df_specs_A_Capacity['Bootspace'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Capacity[df_specs_A_Capacity['Bootspace']==''].shape[0],'\n',
       "space - ",df_specs_A_Capacity[df_specs_A_Capacity['Bootspace']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Capacity[df_specs_A_Capacity['Bootspace']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Capacity[df_specs_A_Capacity['Bootspace']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Capacity[df_specs_A_Capacity['Bootspace']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Capacity[df_specs_A_Capacity['Bootspace']== '0'].shape[0],'\n'
     )


null -  16042 
 nan -  16042 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* there are huge no. of null values
* converting the values to int or float

In [311]:
df_specs_A_Capacity['Bootspace'].unique()

array([nan, '510', '400', '285', '420', '346', '256', '180', '320', '475',
       '407', '530', '284', '670', '243', '218', '352', '380', '235',
       '384', '480', '540', '378', '339', '690', '433', '190', '324',
       '556', '465', '205', '354', '615', '251', '494', '454', '358',
       '212', '460', '135', '300', '565', '506', '390', '295', '315',
       '421', '570', '175', '328', '680', '242', '595', '490', '345',
       '416', '560', '515', '625', '425', '311', '278', '363', '240',
       '359', '586', '341', '257', '215', '470', '236', '330', '438',
       '260', '353', '170', '280', '550', '981', '220', '392', '590',
       '605', '350', '448', '177', '633', '268', '265', '520', '432',
       '505', '673', '897', '296', '225', '209', '419', '493', '430',
       '222', '355', '347', '521', '630', '580', '623', '343', '270',
       '458', '373', '234', '483', '405', '455', '500', '447', '461',
       '435', '650', '1025', '476', '402', '513', '610', '313', '211',
       '279', 

In [312]:
df_specs_A_Capacity['Bootspace'] = df_specs_A_Capacity['Bootspace'].astype('float64')

In [313]:
df_specs_A_Capacity['Bootspace'].unique()

array([  nan,  510.,  400.,  285.,  420.,  346.,  256.,  180.,  320.,
        475.,  407.,  530.,  284.,  670.,  243.,  218.,  352.,  380.,
        235.,  384.,  480.,  540.,  378.,  339.,  690.,  433.,  190.,
        324.,  556.,  465.,  205.,  354.,  615.,  251.,  494.,  454.,
        358.,  212.,  460.,  135.,  300.,  565.,  506.,  390.,  295.,
        315.,  421.,  570.,  175.,  328.,  680.,  242.,  595.,  490.,
        345.,  416.,  560.,  515.,  625.,  425.,  311.,  278.,  363.,
        240.,  359.,  586.,  341.,  257.,  215.,  470.,  236.,  330.,
        438.,  260.,  353.,  170.,  280.,  550.,  981.,  220.,  392.,
        590.,  605.,  350.,  448.,  177.,  633.,  268.,  265.,  520.,
        432.,  505.,  673.,  897.,  296.,  225.,  209.,  419.,  493.,
        430.,  222.,  355.,  347.,  521.,  630.,  580.,  623.,  343.,
        270.,  458.,  373.,  234.,  483.,  405.,  455.,  500.,  447.,
        461.,  435.,  650., 1025.,  476.,  402.,  513.,  610.,  313.,
        211.,  279.,

In [314]:
df_specs_A_Capacity.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58423 entries, 0 to 58422
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   profileId           58423 non-null  object 
 1   vehicle             58423 non-null  object 
 2   Bootspace           42381 non-null  float64
 3   Doors               58415 non-null  object 
 4   Fuel Tank Capacity  57408 non-null  object 
 5   No of Seating Rows  52199 non-null  object 
 6   Seating Capacity    58422 non-null  object 
dtypes: float64(1), object(6)
memory usage: 3.1+ MB


##### `Doors`

In [315]:
df_specs_A_Capacity['Doors'] = df_specs_A_Capacity['Doors'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Capacity[df_specs_A_Capacity['Doors'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Capacity[df_specs_A_Capacity['Doors'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Capacity[df_specs_A_Capacity['Doors']==''].shape[0],'\n',
       "space - ",df_specs_A_Capacity[df_specs_A_Capacity['Doors']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Capacity[df_specs_A_Capacity['Doors']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Capacity[df_specs_A_Capacity['Doors']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Capacity[df_specs_A_Capacity['Doors']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Capacity[df_specs_A_Capacity['Doors']== '0'].shape[0],'\n'
     )


null -  8 
 nan -  8 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* not much null values

In [316]:
df_specs_A_Capacity['Doors'].unique()

array(['5', '4', '2', '3', nan], dtype=object)

In [317]:
df_specs_A_Capacity['Doors'] = df_specs_A_Capacity['Doors'].astype('Int64')

In [318]:
df_specs_A_Capacity['Doors'].unique()

<IntegerArray>
[5, 4, 2, 3, <NA>]
Length: 5, dtype: Int64

In [319]:
df_specs_A_Capacity['Doors'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 58423 entries, 0 to 58422
Series name: Doors
Non-Null Count  Dtype
--------------  -----
58415 non-null  Int64
dtypes: Int64(1)
memory usage: 513.6 KB


##### `Fuel Tank Capacity`

In [320]:
df_specs_A_Capacity['Fuel Tank Capacity'] = df_specs_A_Capacity['Fuel Tank Capacity'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Capacity[df_specs_A_Capacity['Fuel Tank Capacity'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Capacity[df_specs_A_Capacity['Fuel Tank Capacity'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Capacity[df_specs_A_Capacity['Fuel Tank Capacity']==''].shape[0],'\n',
       "space - ",df_specs_A_Capacity[df_specs_A_Capacity['Fuel Tank Capacity']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Capacity[df_specs_A_Capacity['Fuel Tank Capacity']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Capacity[df_specs_A_Capacity['Fuel Tank Capacity']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Capacity[df_specs_A_Capacity['Fuel Tank Capacity']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Capacity[df_specs_A_Capacity['Fuel Tank Capacity']== '0'].shape[0],'\n'
     )


null -  1015 
 nan -  1015 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* a few rows are null

In [321]:
df_specs_A_Capacity['Fuel Tank Capacity'].unique()

array(['35', '82', '40', '70', '56', '52', '43', '55', '42', '50', '80',
       '45', '85', '60', nan, '67.5', '28', '63', '37', '38', '57', '65',
       '71', '41', '90', '44', '75', '95', '77', '76', '48', '100', '96',
       '78', '66', '27', '64', '32', '36', '105', '54', '67', '58', '51',
       '92', '81', '15', '93', '61', '47', '87', '83', '73', '88', '104',
       '59', '62', '93.5', '68', '70.6', '30', '90.5', '24', '53', '52.5',
       '74', '26.2', '66.5', '60.9', '82.5', '72', '26', '138', '110',
       '86', '89', '88.5', '33', '62.4', '63.5', '71.5'], dtype=object)

In [322]:
df_specs_A_Capacity['Fuel Tank Capacity'] = df_specs_A_Capacity['Fuel Tank Capacity'].astype('Float64')

In [323]:
df_specs_A_Capacity['Fuel Tank Capacity'].unique()

<FloatingArray>
[ 35.0,  82.0,  40.0,  70.0,  56.0,  52.0,  43.0,  55.0,  42.0,  50.0,  80.0,
  45.0,  85.0,  60.0,  <NA>,  67.5,  28.0,  63.0,  37.0,  38.0,  57.0,  65.0,
  71.0,  41.0,  90.0,  44.0,  75.0,  95.0,  77.0,  76.0,  48.0, 100.0,  96.0,
  78.0,  66.0,  27.0,  64.0,  32.0,  36.0, 105.0,  54.0,  67.0,  58.0,  51.0,
  92.0,  81.0,  15.0,  93.0,  61.0,  47.0,  87.0,  83.0,  73.0,  88.0, 104.0,
  59.0,  62.0,  93.5,  68.0,  70.6,  30.0,  90.5,  24.0,  53.0,  52.5,  74.0,
  26.2,  66.5,  60.9,  82.5,  72.0,  26.0, 138.0, 110.0,  86.0,  89.0,  88.5,
  33.0,  62.4,  63.5,  71.5]
Length: 81, dtype: Float64

In [324]:
df_specs_A_Capacity['Fuel Tank Capacity'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 58423 entries, 0 to 58422
Series name: Fuel Tank Capacity
Non-Null Count  Dtype  
--------------  -----  
57408 non-null  Float64
dtypes: Float64(1)
memory usage: 513.6 KB


##### `No of Seating Rows`

In [325]:
df_specs_A_Capacity['No of Seating Rows'] = df_specs_A_Capacity['No of Seating Rows'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Capacity[df_specs_A_Capacity['No of Seating Rows'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Capacity[df_specs_A_Capacity['No of Seating Rows'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Capacity[df_specs_A_Capacity['No of Seating Rows']==''].shape[0],'\n',
       "space - ",df_specs_A_Capacity[df_specs_A_Capacity['No of Seating Rows']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Capacity[df_specs_A_Capacity['No of Seating Rows']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Capacity[df_specs_A_Capacity['No of Seating Rows']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Capacity[df_specs_A_Capacity['No of Seating Rows']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Capacity[df_specs_A_Capacity['No of Seating Rows']== '0'].shape[0],'\n'
     )


null -  6224 
 nan -  6224 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



* a huge no. of null values are found

In [326]:
df_specs_A_Capacity['No of Seating Rows'].unique()

array(['2', '3', nan, '1', '4'], dtype=object)

In [327]:
df_specs_A_Capacity['No of Seating Rows'] = df_specs_A_Capacity['No of Seating Rows'].astype('Int64')

In [328]:
df_specs_A_Capacity['No of Seating Rows'].unique()

<IntegerArray>
[2, 3, <NA>, 1, 4]
Length: 5, dtype: Int64

In [329]:
df_specs_A_Capacity['No of Seating Rows'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 58423 entries, 0 to 58422
Series name: No of Seating Rows
Non-Null Count  Dtype
--------------  -----
52199 non-null  Int64
dtypes: Int64(1)
memory usage: 513.6 KB


##### `Seating Capacity`

In [330]:
df_specs_A_Capacity['Seating Capacity'] = df_specs_A_Capacity['Seating Capacity'].str.strip()

# checking inconsistencies and null values

print( "null - ", df_specs_A_Capacity[df_specs_A_Capacity['Seating Capacity'].isnull()].shape[0],'\n',
       "nan - ",df_specs_A_Capacity[df_specs_A_Capacity['Seating Capacity'].isna()].shape[0],'\n',
       "empty - ", df_specs_A_Capacity[df_specs_A_Capacity['Seating Capacity']==''].shape[0],'\n',
       "space - ",df_specs_A_Capacity[df_specs_A_Capacity['Seating Capacity']==' '].shape[0],'\n',
       "NA - ", df_specs_A_Capacity[df_specs_A_Capacity['Seating Capacity']=='Not Available'].shape[0],'\n',
       "None - ", df_specs_A_Capacity[df_specs_A_Capacity['Seating Capacity']== None].shape[0],'\n',
       "0 - ",  df_specs_A_Capacity[df_specs_A_Capacity['Seating Capacity']== 0].shape[0],'\n',
       "0(str) - ",  df_specs_A_Capacity[df_specs_A_Capacity['Seating Capacity']== '0'].shape[0],'\n'
     )


null -  1 
 nan -  1 
 empty -  0 
 space -  0 
 NA -  0 
 None -  0 
 0 -  0 
 0(str) -  0 



In [331]:
df_specs_A_Capacity['Seating Capacity'].unique()

array(['5', '4', '7', '8', '6', '7 & 9', '7 & 8', '2', '9', nan, '10'],
      dtype=object)

In [332]:
df_specs_A_Capacity['Seating Capacity'].replace(['7 & 9','7 & 8'],['7','7'],inplace = True)

In [333]:
df_specs_A_Capacity['Seating Capacity'].unique()

array(['5', '4', '7', '8', '6', '2', '9', nan, '10'], dtype=object)

In [334]:
df_specs_A_Capacity['Seating Capacity'] = df_specs_A_Capacity['Seating Capacity'].astype('Int64')

In [335]:
df_specs_A_Capacity['Seating Capacity'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 58423 entries, 0 to 58422
Series name: Seating Capacity
Non-Null Count  Dtype
--------------  -----
58422 non-null  Int64
dtypes: Int64(1)
memory usage: 513.6 KB


##### FINAL DATASET

In [336]:
df_specs_A_Capacity_2 = df_specs_A_Capacity.copy()

In [337]:
df_specs_A_Capacity.columns

Index(['profileId', 'vehicle', 'Bootspace', 'Doors', 'Fuel Tank Capacity',
       'No of Seating Rows', 'Seating Capacity'],
      dtype='object', name='')

In [338]:
df_specs_A_Capacity.rename(columns = {
                                         'Bootspace': 'Bootspace_(litres)',
                                         'Doors': 'Doors',
                                         'Fuel Tank Capacity': 'Fuel_Tank_Capacity_(litres)',
                                         'No of Seating Rows': 'Seating_Rows_(rows)',
                                         'Seating Capacity': 'Seating_Capacity_(persons)'
                                        },
                             inplace=True)

In [339]:
df_specs_A_Capacity.head(2)

Unnamed: 0,profileId,vehicle,Bootspace_(litres),Doors,Fuel_Tank_Capacity_(litres),Seating_Rows_(rows),Seating_Capacity_(persons)
0,D1820959,Hyundai i10 Sportz 1.2 AT Kappa2,,5,35.0,2,5
1,D1982769,Audi A8 L 50 TDI,510.0,4,82.0,2,4


In [340]:
df_specs_A_Capacity.to_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-II\specs_Capacity_A.csv',index=False)

## <div style = "padding: 20px; border-radius: 40px; background-color: #9FE2BF"> Dataset B - (using POLARS)</div>


### `Overview` DATA

In [341]:
df_overview_B = pl.read_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-I\overview_B.csv')

In [342]:
df_overview_B.shape

(31517, 57)

In [343]:
df_overview_B.describe()

describe,carId,profileId,vehicle,city,price,color,kilometers,fuelName,transmissionType,sellerId,insurance,insuranceExpiry,interiorColor,lifeTimeTax,insuranceLink,makeName,modelName,versionName,mainImageUrl,photoCount,makeYear,makeMonth,cityMaskingName,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,lastUpdatedDate,totalPhotosUploaded,similarCarsUrl,stockRecommendationUrl,cwBasePackageId,ctePackageId,modelId,versionId,registrationNumber,tcStockId,regType,priceNumeric,isCertified,oemVehicleUrl,videoUrl,modelMaskingName,rootName,bodyStyleId,isSellCarOfferAvailable,allowBooking,virtualPhoneNumber,emiPrice,isHomeTestDrive,homeTestDriveSlug,formattedRegistrationDate,fuelEconomy,dealershipLogoUrl,isSold,formattedOriginalPrice
str,f64,str,str,str,str,str,str,str,str,f64,str,str,str,str,str,str,str,str,str,f64,f64,str,str,str,f64,str,str,str,str,f64,str,str,f64,f64,f64,f64,str,f64,str,f64,f64,str,str,str,str,f64,f64,f64,f64,str,f64,str,str,str,str,f64,str
"""count""",31517.0,"""31517""","""31517""","""31517""","""31517""","""31517""","""31517""","""31517""","""31517""",31517.0,"""31517""","""31517""","""31517""","""31517""","""31517""","""31517""","""31517""","""31517""","""31517""",31517.0,31517.0,"""31517""","""31517""","""31517""",31517.0,"""31517""","""31517""","""31517""","""31517""",31517.0,"""31517""","""31517""",31517.0,31517.0,31517.0,31517.0,"""31517""",31517.0,"""31517""",31517.0,31517.0,"""31517""","""31517""","""31517""","""31517""",31517.0,31517.0,31517.0,31517.0,"""31517""",31517.0,"""31517""","""31517""","""31517""","""31517""",31517.0,"""31517"""
"""null_count""",0.0,"""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""",0.0,"""0""","""21207""","""31328""","""0""","""0""","""0""","""0""","""0""","""5314""",0.0,0.0,"""0""","""0""","""0""",0.0,"""0""","""0""","""0""","""0""",0.0,"""0""","""0""",0.0,0.0,0.0,0.0,"""0""",0.0,"""22927""",0.0,0.0,"""31517""","""30964""","""0""","""0""",0.0,0.0,0.0,20305.0,"""0""",0.0,"""9873""","""0""","""0""","""30238""",0.0,"""30994"""
"""mean""",3569.307517,,,,,,,,,1.31326,,,,,,,,,,10.18755,2016.411587,,,,51.278263,,,,,10.189612,,,2.310943,89.872196,829.111622,4604.898436,,2119500.0,,1479200.0,0.0,,,,,4.189453,0.009455,0.00146,8160500000.0,,0.420027,,,,,0.0,
"""std""",2385.056231,,,,,,,,,0.463826,,,,,,,,,,6.589452,4.023541,,,,78.411382,,,,,6.587464,,,2.26793,61.352483,407.992133,2314.560717,,1433600.0,,2209600.0,0.0,,,,,2.794092,0.096779,0.038177,1109400000.0,,0.493571,,,,,0.0,
"""min""",0.0,"""D1982769""","""Aston Martin R…","""Bangalore""","""1 Crore""",""" Magma Grey""","""0""","""CNG""","""Automatic""",1.0,"""Comprehensive""","""01 Apr 2023""","""2023""","""Commercial Reg…","""/insurance/?ca…","""Aston Martin""","""1 Series""","""1.0 Kappa Magn…","""https://imgd-c…",0.0,1900.0,"""Apr""","""ahmedabad""","""Ahmedabad""",1.0,"""4 or More""","""23BH""",""" Outer Ring Ro…","""1 day(s) ago""",0.0,"""/api/stocks/D1…","""/api/stocks/D1…",0.0,0.0,1.0,3.0,""" """,0.0,"""Corporate""",40000.0,0.0,,"""https://www.yo…","""1-series""","""1-Series""",1.0,0.0,0.0,6366800000.0,"""1 L""",0.0,"""{'titlePrefix'…","""Apr 2008""","""10 kpl""","""https://imgd.a…",0.0,"""Rs. 1.09 Crore…"
"""25%""",1577.0,,,,,,,,,1.0,,,,,,,,,,5.0,2014.0,,,,2.0,,,,,5.0,,,0.0,0.0,540.0,2826.0,,0.0,,460000.0,,,,,,1.0,,,7045800000.0,,,,,,,,
"""50%""",3208.0,,,,,,,,,1.0,,,,,,,,,,11.0,2017.0,,,,10.0,,,,,11.0,,,2.0,123.0,862.0,4513.0,,3082172.0,,760000.0,,,,,,3.0,,,7303400000.0,,,,,,,,
"""75%""",5351.0,,,,,,,,,2.0,,,,,,,,,,15.0,2019.0,,,,105.0,,,,,15.0,,,3.0,127.0,1136.0,6002.0,,3123277.0,,1575000.0,,,,,,6.0,,,9176400000.0,,,,,,,,
"""max""",9450.0,"""S2805517""","""Volvo XC90 Mom…","""Mumbai""","""99.99 Lakh""","""Zanskar Blue""","""999""","""Petrol + Petro…","""Manual""",2.0,"""ThirdParty""","""31 Oct 2023""","""White ""","""Taxi""","""/insurance/?ca…","""Volvo""","""redi-GO [2016-…","""xDrive40i Spor…","""https://imgd.a…",46.0,2023.0,"""Sep""","""udupi""","""Udupi""",1281.0,"""UnRegistered C…","""west Delhi""","""kaikondrahalli…","""9 month(s) ago…",46.0,"""/api/stocks/S2…","""/api/stocks/S2…",7.0,148.0,2553.0,14999.0,"""wb02as2431""",3145396.0,"""Taxi""",54500000.0,0.0,,"""https://www.yo…","""zs-ev-2020-202…","""iX""",11.0,1.0,1.0,9963000000.0,"""996""",1.0,"""{'titlePrefix'…","""Sep 2031""","""Not Available""","""https://imgd.a…",0.0,"""Rs. 98.75 Lakh…"


In [344]:
df_overview_B.head(2)

carId,profileId,vehicle,city,price,color,kilometers,fuelName,transmissionType,sellerId,insurance,insuranceExpiry,interiorColor,lifeTimeTax,insuranceLink,makeName,modelName,versionName,mainImageUrl,photoCount,makeYear,makeMonth,cityMaskingName,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,lastUpdatedDate,totalPhotosUploaded,similarCarsUrl,stockRecommendationUrl,cwBasePackageId,ctePackageId,modelId,versionId,registrationNumber,tcStockId,regType,priceNumeric,isCertified,oemVehicleUrl,videoUrl,modelMaskingName,rootName,bodyStyleId,isSellCarOfferAvailable,allowBooking,virtualPhoneNumber,emiPrice,isHomeTestDrive,homeTestDriveSlug,formattedRegistrationDate,fuelEconomy,dealershipLogoUrl,isSold,formattedOriginalPrice
i64,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,str,str,str,str,i64,i64,str,str,str,i64,str,str,str,str,i64,str,str,i64,i64,i64,i64,str,i64,str,i64,bool,str,str,str,str,i64,bool,bool,f64,str,bool,str,str,str,str,bool,str
0,"""D4135089""","""Tata Harrier X…","""Bangalore""","""18.95 Lakh""","""Black""","""39,420""","""Diesel""","""Manual""",1,"""Comprehensive""","""24 Apr 2024""",,"""Individual""","""/insurance/?ca…","""Tata""","""Harrier [2019-…","""XZ Dark Editio…","""https://imgd-c…",22,2021,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Singasandra, B…","""14 day(s) ago""",22,"""/api/stocks/D4…","""/api/stocks/D4…",6,146,1162,9097,"""KA51MQ9123""",3123154,,1895000,False,,"""https://www.yo…","""harrier-2019-2…","""Harrier""",6,False,False,7303400000.0,"""31,469""",False,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,
1,"""D4170587""","""BMW 3 Series G…","""Bangalore""","""34.95 Lakh""","""Grey""","""38,703""","""Diesel""","""Automatic""",1,"""Comprehensive""","""16 Jul 2024""",,"""Individual""","""/insurance/?ca…","""BMW""","""3 Series GT [2…","""320d Luxury Li…","""https://imgd.a…",22,2017,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Singasandra, B…","""22 hour(s) ago…",22,"""/api/stocks/D4…","""/api/stocks/D4…",6,146,802,3237,"""KA51MK9891""",3142670,,3495000,False,,"""https://www.yo…","""3-series-gt-20…","""3-Series""",1,False,False,7303400000.0,"""58,040""",False,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,


#### REMOVING DUPLICATES

In [345]:
# cloning the dataframe from the previous process for contingency
# can be restored from previous process

df_overview_B_1 = df_overview_B.clone()

* `profileId` is the unique id for each car record
* as below each unique id is repeated multiple times

In [346]:
df_overview_B['profileId'].unique().shape

(30274,)

In [347]:
df_overview_B.group_by('profileId').count().filter(pl.col('count')>1).sort(by = 'count', descending = True)

profileId,count
str,u32
"""S2679811""",13
"""S2714391""",11
"""S2787361""",11
"""D4091389""",10
"""S2687517""",9
"""S2660531""",9
"""S2781695""",8
"""D4038483""",7
"""S2684173""",7
"""D3876199""",7


In [348]:
df_overview_B.filter(pl.col('profileId') == 'D4038483')

carId,profileId,vehicle,city,price,color,kilometers,fuelName,transmissionType,sellerId,insurance,insuranceExpiry,interiorColor,lifeTimeTax,insuranceLink,makeName,modelName,versionName,mainImageUrl,photoCount,makeYear,makeMonth,cityMaskingName,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,lastUpdatedDate,totalPhotosUploaded,similarCarsUrl,stockRecommendationUrl,cwBasePackageId,ctePackageId,modelId,versionId,registrationNumber,tcStockId,regType,priceNumeric,isCertified,oemVehicleUrl,videoUrl,modelMaskingName,rootName,bodyStyleId,isSellCarOfferAvailable,allowBooking,virtualPhoneNumber,emiPrice,isHomeTestDrive,homeTestDriveSlug,formattedRegistrationDate,fuelEconomy,dealershipLogoUrl,isSold,formattedOriginalPrice
i64,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,str,str,str,str,i64,i64,str,str,str,i64,str,str,str,str,i64,str,str,i64,i64,i64,i64,str,i64,str,i64,bool,str,str,str,str,i64,bool,bool,f64,str,bool,str,str,str,str,bool,str
6830,"""D4038483""","""Ford Endeavour…","""Bangalore""","""27.5 Lakh""","""White""","""1,05,000""","""Diesel""","""Automatic""",1,"""Comprehensive""",,,"""Individual""","""/insurance/?ca…","""Ford""","""Endeavour [201…","""Titanium 3.2 4…","""https://imgd-c…",18,2017,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Arunachalam Mu…","""1 month(s) ago…",18,"""/api/stocks/D4…","""/api/stocks/D4…",2,127,956,4447,"""KA631234""",3069549,,2750000,False,,,"""endeavour-2016…","""Endeavour""",6,False,False,7045800000.0,"""45,668""",True,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,
6831,"""D4038483""","""Fiat Linea Emo…","""Bangalore""","""27.5 Lakh""","""White""","""1,05,000""","""Diesel""","""Automatic""",1,"""Comprehensive""",,,"""Individual""","""/insurance/?ca…","""Ford""","""Endeavour [201…","""Titanium 3.2 4…","""https://imgd-c…",18,2017,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Arunachalam Mu…","""1 month(s) ago…",18,"""/api/stocks/D4…","""/api/stocks/D4…",2,127,956,4447,"""KA631234""",3069549,,2750000,False,,,"""endeavour-2016…","""Endeavour""",6,False,False,7045800000.0,"""45,668""",True,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,
6832,"""D4038483""","""Maruti Suzuki …","""Bangalore""","""27.5 Lakh""","""White""","""1,05,000""","""Diesel""","""Automatic""",1,"""Comprehensive""",,,"""Individual""","""/insurance/?ca…","""Ford""","""Endeavour [201…","""Titanium 3.2 4…","""https://imgd-c…",18,2017,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Arunachalam Mu…","""1 month(s) ago…",18,"""/api/stocks/D4…","""/api/stocks/D4…",2,127,956,4447,"""KA631234""",3069549,,2750000,False,,,"""endeavour-2016…","""Endeavour""",6,False,False,7045800000.0,"""45,668""",True,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,
6833,"""D4038483""","""Toyota Corolla…","""Bangalore""","""27.5 Lakh""","""White""","""1,05,000""","""Diesel""","""Automatic""",1,"""Comprehensive""",,,"""Individual""","""/insurance/?ca…","""Ford""","""Endeavour [201…","""Titanium 3.2 4…","""https://imgd-c…",18,2017,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Arunachalam Mu…","""1 month(s) ago…",18,"""/api/stocks/D4…","""/api/stocks/D4…",2,127,956,4447,"""KA631234""",3069549,,2750000,False,,,"""endeavour-2016…","""Endeavour""",6,False,False,7045800000.0,"""45,668""",True,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,
6834,"""D4038483""","""Honda City 1.5…","""Bangalore""","""27.5 Lakh""","""White""","""1,05,000""","""Diesel""","""Automatic""",1,"""Comprehensive""",,,"""Individual""","""/insurance/?ca…","""Ford""","""Endeavour [201…","""Titanium 3.2 4…","""https://imgd-c…",18,2017,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Arunachalam Mu…","""1 month(s) ago…",18,"""/api/stocks/D4…","""/api/stocks/D4…",2,127,956,4447,"""KA631234""",3069549,,2750000,False,,,"""endeavour-2016…","""Endeavour""",6,False,False,7045800000.0,"""45,668""",True,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,
6835,"""D4038483""","""Toyota Corolla…","""Bangalore""","""27.5 Lakh""","""White""","""1,05,000""","""Diesel""","""Automatic""",1,"""Comprehensive""",,,"""Individual""","""/insurance/?ca…","""Ford""","""Endeavour [201…","""Titanium 3.2 4…","""https://imgd-c…",18,2017,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Arunachalam Mu…","""1 month(s) ago…",18,"""/api/stocks/D4…","""/api/stocks/D4…",2,127,956,4447,"""KA631234""",3069549,,2750000,False,,,"""endeavour-2016…","""Endeavour""",6,False,False,7045800000.0,"""45,668""",True,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,
6836,"""D4038483""","""Kia Seltos GTX…","""Bangalore""","""27.5 Lakh""","""White""","""1,05,000""","""Diesel""","""Automatic""",1,"""Comprehensive""",,,"""Individual""","""/insurance/?ca…","""Ford""","""Endeavour [201…","""Titanium 3.2 4…","""https://imgd-c…",18,2017,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Arunachalam Mu…","""1 month(s) ago…",18,"""/api/stocks/D4…","""/api/stocks/D4…",2,127,956,4447,"""KA631234""",3069549,,2750000,False,,,"""endeavour-2016…","""Endeavour""",6,False,False,7045800000.0,"""45,668""",True,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,


In [349]:
# removing duplicate records w.r.t profileId
df_overview_B = df_overview_B.unique(subset = ['profileId'],keep = 'first',maintain_order = True).clone()

In [350]:
df_overview_B.describe()

describe,carId,profileId,vehicle,city,price,color,kilometers,fuelName,transmissionType,sellerId,insurance,insuranceExpiry,interiorColor,lifeTimeTax,insuranceLink,makeName,modelName,versionName,mainImageUrl,photoCount,makeYear,makeMonth,cityMaskingName,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,lastUpdatedDate,totalPhotosUploaded,similarCarsUrl,stockRecommendationUrl,cwBasePackageId,ctePackageId,modelId,versionId,registrationNumber,tcStockId,regType,priceNumeric,isCertified,oemVehicleUrl,videoUrl,modelMaskingName,rootName,bodyStyleId,isSellCarOfferAvailable,allowBooking,virtualPhoneNumber,emiPrice,isHomeTestDrive,homeTestDriveSlug,formattedRegistrationDate,fuelEconomy,dealershipLogoUrl,isSold,formattedOriginalPrice
str,f64,str,str,str,str,str,str,str,str,f64,str,str,str,str,str,str,str,str,str,f64,f64,str,str,str,f64,str,str,str,str,f64,str,str,f64,f64,f64,f64,str,f64,str,f64,f64,str,str,str,str,f64,f64,f64,f64,str,f64,str,str,str,str,f64,str
"""count""",30274.0,"""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""",30274.0,30274.0,"""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""",30274.0,30274.0,30274.0,30274.0,"""30274""",30274.0,"""30274""",30274.0,30274.0,"""30274""","""30274""","""30274""","""30274""",30274.0,30274.0,30274.0,30274.0,"""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""",30274.0,"""30274"""
"""null_count""",0.0,"""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""",0.0,"""0""","""20362""","""30093""","""0""","""0""","""0""","""0""","""0""","""5088""",0.0,0.0,"""0""","""0""","""0""",0.0,"""0""","""0""","""0""","""0""",0.0,"""0""","""0""",0.0,0.0,0.0,0.0,"""0""",0.0,"""22068""",0.0,0.0,"""30274""","""29736""","""0""","""0""",0.0,0.0,0.0,19558.0,"""0""",0.0,"""9439""","""0""","""0""","""29036""",0.0,"""29781"""
"""mean""",3550.432054,,,,,,,,,1.311786,,,,,,,,,,10.197463,2016.437075,,,,49.271685,,,,,10.199412,,,2.322455,90.082018,830.26389,4613.340556,,2125100.0,,1436900.0,0.0,,,,,4.199115,0.009744,0.001519,8176500000.0,,0.420856,,,,,0.0,
"""std""",2389.923857,,,,,,,,,0.46323,,,,,,,,,,6.55814,3.973605,,,,76.364295,,,,,6.556245,,,2.281108,61.292496,408.190728,2318.909168,,1432400.0,,2134200.0,0.0,,,,,2.791714,0.098233,0.038951,1109800000.0,,0.493705,,,,,0.0,
"""min""",0.0,"""D1982769""","""Aston Martin R…","""Bangalore""","""1 Crore""",""" Magma Grey""","""0""","""CNG""","""Automatic""",1.0,"""Comprehensive""","""01 Apr 2023""","""2023""","""Commercial Reg…","""/insurance/?ca…","""Aston Martin""","""1 Series""","""1.0 Kappa Magn…","""https://imgd-c…",0.0,1900.0,"""Apr""","""ahmedabad""","""Ahmedabad""",1.0,"""4 or More""","""23BH""",""" Outer Ring Ro…","""1 day(s) ago""",0.0,"""/api/stocks/D1…","""/api/stocks/D1…",0.0,0.0,1.0,3.0,""" """,0.0,"""Corporate""",40000.0,0.0,,"""https://www.yo…","""1-series""","""1-Series""",1.0,0.0,0.0,6366800000.0,"""1 L""",0.0,"""{'titlePrefix'…","""Apr 2008""","""10 kpl""","""https://imgd.a…",0.0,"""Rs. 1.09 Crore…"
"""25%""",1530.0,,,,,,,,,1.0,,,,,,,,,,5.0,2014.0,,,,2.0,,,,,5.0,,,0.0,0.0,542.0,2826.0,,0.0,,453775.0,,,,,,2.0,,,7045800000.0,,,,,,,,
"""50%""",3187.0,,,,,,,,,1.0,,,,,,,,,,11.0,2017.0,,,,10.0,,,,,11.0,,,2.0,123.0,862.0,4520.0,,3084389.0,,750000.0,,,,,,3.0,,,7303400000.0,,,,,,,,
"""75%""",5344.0,,,,,,,,,2.0,,,,,,,,,,15.0,2019.0,,,,105.0,,,,,15.0,,,3.0,127.0,1138.0,6017.0,,3123705.0,,1525000.0,,,,,,6.0,,,9176400000.0,,,,,,,,
"""max""",9450.0,"""S2805517""","""Volvo XC90 Mom…","""Mumbai""","""99.99 Lakh""","""Zanskar Blue""","""999""","""Petrol + Petro…","""Manual""",2.0,"""ThirdParty""","""31 Oct 2023""","""White ""","""Taxi""","""/insurance/?ca…","""Volvo""","""redi-GO [2016-…","""xDrive40i Spor…","""https://imgd.a…",46.0,2023.0,"""Sep""","""udupi""","""Udupi""",1281.0,"""UnRegistered C…","""west Delhi""","""kaikondrahalli…","""9 month(s) ago…",46.0,"""/api/stocks/S2…","""/api/stocks/S2…",7.0,148.0,2553.0,14999.0,"""wb02as2431""",3145396.0,"""Taxi""",54500000.0,0.0,,"""https://www.yo…","""zs-ev-2020-202…","""iX""",11.0,1.0,1.0,9963000000.0,"""996""",1.0,"""{'titlePrefix'…","""Sep 2031""","""Not Available""","""https://imgd.a…",0.0,"""Rs. 98.75 Lakh…"


#### COLUMN-WISE UNDERSTANDING OF DATA AND CLEANING

In [351]:
# taking a clone after removing duplicates
df_overview_B_2 = df_overview_B.clone()

In [352]:
# checking the kind of data in each column
df_overview_B.head()[:,0:20]

carId,profileId,vehicle,city,price,color,kilometers,fuelName,transmissionType,sellerId,insurance,insuranceExpiry,interiorColor,lifeTimeTax,insuranceLink,makeName,modelName,versionName,mainImageUrl,photoCount
i64,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,str,str,str,str,i64
0,"""D4135089""","""Tata Harrier X…","""Bangalore""","""18.95 Lakh""","""Black""","""39,420""","""Diesel""","""Manual""",1,"""Comprehensive""","""24 Apr 2024""",,"""Individual""","""/insurance/?ca…","""Tata""","""Harrier [2019-…","""XZ Dark Editio…","""https://imgd-c…",22
1,"""D4170587""","""BMW 3 Series G…","""Bangalore""","""34.95 Lakh""","""Grey""","""38,703""","""Diesel""","""Automatic""",1,"""Comprehensive""","""16 Jul 2024""",,"""Individual""","""/insurance/?ca…","""BMW""","""3 Series GT [2…","""320d Luxury Li…","""https://imgd.a…",22
2,"""D4170559""","""Jeep Compass L…","""Bangalore""","""14.45 Lakh""","""Silver""","""49,209""","""Diesel""","""Manual""",1,"""Comprehensive""","""17 Aug 2024""",,"""Individual""","""/insurance/?ca…","""Jeep""","""Compass [2017-…","""Limited 2.0 Di…","""https://imgd-c…",20
3,"""D4112409""","""Mercedes-Benz …","""Bangalore""","""56.76 Lakh""","""Grey""","""36,464""","""Diesel""","""Automatic (TC)…",1,"""Comprehensive""","""20 Jun 2024""",,"""Individual""","""/insurance/?ca…","""Mercedes-Benz""","""C-Class""","""C 220d""","""https://imgd.a…",22
4,"""D4087571""","""Jeep Compass L…","""Bangalore""","""17.45 Lakh""","""Grey""","""65,000""","""Petrol""","""Automatic""",1,"""Comprehensive""",,,"""Individual""","""/insurance/?ca…","""Jeep""","""Compass [2017-…","""Limited Plus P…","""https://imgd.a…",22


In [353]:
# checking the kind of data in each column
df_overview_B.head()[:,20:40]

makeYear,makeMonth,cityMaskingName,cityName,cityId,noOfOwners,registerCity,carAvailbaleAt,lastUpdatedDate,totalPhotosUploaded,similarCarsUrl,stockRecommendationUrl,cwBasePackageId,ctePackageId,modelId,versionId,registrationNumber,tcStockId,regType,priceNumeric
i64,str,str,str,i64,str,str,str,str,i64,str,str,i64,i64,i64,i64,str,i64,str,i64
2021,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Singasandra, B…","""14 day(s) ago""",22,"""/api/stocks/D4…","""/api/stocks/D4…",6,146,1162,9097,"""KA51MQ9123""",3123154,,1895000
2017,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Singasandra, B…","""22 hour(s) ago…",22,"""/api/stocks/D4…","""/api/stocks/D4…",6,146,802,3237,"""KA51MK9891""",3142670,,3495000
2017,"""Jun""","""bangalore""","""Bangalore""",2,"""Second""","""Not Available""","""Singasandra, B…","""22 hour(s) ago…",20,"""/api/stocks/D4…","""/api/stocks/D4…",6,146,1048,5367,"""KA03NB0575""",3142656,,1445000
2022,"""May""","""bangalore""","""Bangalore""",2,"""First""","""Bangalore ""","""BTM Layout, Ba…","""23 day(s) ago""",22,"""/api/stocks/D4…","""/api/stocks/D4…",6,146,1845,10133,"""KA28MB4777""",3110602,,5676000
2019,"""Jun""","""bangalore""","""Bangalore""",2,"""First""","""Not Available""","""Adugodi, Banga…","""1 month(s) ago…",22,"""/api/stocks/D4…","""/api/stocks/D4…",4,144,1048,5932,"""KA01MT7980""",3096618,,1745000


In [354]:
# checking the kind of data in each column
df_overview_B.head()[:,40:60]

isCertified,oemVehicleUrl,videoUrl,modelMaskingName,rootName,bodyStyleId,isSellCarOfferAvailable,allowBooking,virtualPhoneNumber,emiPrice,isHomeTestDrive,homeTestDriveSlug,formattedRegistrationDate,fuelEconomy,dealershipLogoUrl,isSold,formattedOriginalPrice
bool,str,str,str,str,i64,bool,bool,f64,str,bool,str,str,str,str,bool,str
False,,"""https://www.yo…","""harrier-2019-2…","""Harrier""",6,False,False,7303400000.0,"""31,469""",False,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,
False,,"""https://www.yo…","""3-series-gt-20…","""3-Series""",1,False,False,7303400000.0,"""58,040""",False,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,
False,,"""https://www.yo…","""compass-2017-2…","""Compass""",6,False,False,7303400000.0,"""23,996""",False,"""{'titlePrefix'…","""Not Available""","""Not Available""",,False,
False,,,"""c-class""","""C-Class""",1,False,False,7045800000.0,"""94,259""",False,"""{'titlePrefix'…","""Jul 2022""","""Not Available""",,False,
False,,,"""compass-2017-2…","""Compass""",6,False,False,9311600000.0,"""28,978""",False,"""{'titlePrefix'…","""Not Available""","""Not Available""","""https://imgd.a…",False,


* after preliminary assessment the below columns seem to be useless for the analysis
* **"color","sellerId","insuranceExpiry","interiorColor","lifeTimeTax","insuranceLink","mainImageUrl","photoCount","cityId","lastUpdatedDate","totalPhotosUploaded","similarCarsUrl","stockRecommendationUrl","cwBasePackageId","ctePackageId","modelId","versionId","tcStockId","isCertified","oemVehicleUrl","videoUrl","modelMaskingName","bodyStyleId","isSellCarOfferAvailable","allowBooking","virtualPhoneNumber","emiPrice","isHomeTestDrive","homeTestDriveSlug","formattedRegistrationDate","fuelEconomy","dealershipLogoUrl","isSold","formattedOriginalPrice"**


In [355]:
df_overview_B = df_overview_B.drop(["carId","color","sellerId","insuranceExpiry","interiorColor","lifeTimeTax",
                                    "insuranceLink","mainImageUrl","photoCount","cityId","lastUpdatedDate",
                                    "totalPhotosUploaded","similarCarsUrl","stockRecommendationUrl",
                                    "cwBasePackageId","ctePackageId","modelId","versionId","tcStockId",
                                    "isCertified","oemVehicleUrl","videoUrl","modelMaskingName","bodyStyleId",
                                    "isSellCarOfferAvailable","allowBooking","virtualPhoneNumber","emiPrice",
                                    "isHomeTestDrive","homeTestDriveSlug","formattedRegistrationDate","fuelEconomy",
                                    "dealershipLogoUrl","isSold","formattedOriginalPrice"]
                                  )

In [356]:
df_overview_B.head(2)

profileId,vehicle,city,price,kilometers,fuelName,transmissionType,insurance,makeName,modelName,versionName,makeYear,makeMonth,cityMaskingName,cityName,noOfOwners,registerCity,carAvailbaleAt,registrationNumber,regType,priceNumeric,rootName
str,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,str,str,str,i64,str
"""D4135089""","""Tata Harrier X…","""Bangalore""","""18.95 Lakh""","""39,420""","""Diesel""","""Manual""","""Comprehensive""","""Tata""","""Harrier [2019-…","""XZ Dark Editio…",2021,"""Jun""","""bangalore""","""Bangalore""","""First""","""Not Available""","""Singasandra, B…","""KA51MQ9123""",,1895000,"""Harrier"""
"""D4170587""","""BMW 3 Series G…","""Bangalore""","""34.95 Lakh""","""38,703""","""Diesel""","""Automatic""","""Comprehensive""","""BMW""","""3 Series GT [2…","""320d Luxury Li…",2017,"""Jun""","""bangalore""","""Bangalore""","""First""","""Not Available""","""Singasandra, B…","""KA51MK9891""",,3495000,"""3-Series"""


In [357]:
df_overview_B.describe()

describe,profileId,vehicle,city,price,kilometers,fuelName,transmissionType,insurance,makeName,modelName,versionName,makeYear,makeMonth,cityMaskingName,cityName,noOfOwners,registerCity,carAvailbaleAt,registrationNumber,regType,priceNumeric,rootName
str,str,str,str,str,str,str,str,str,str,str,str,f64,str,str,str,str,str,str,str,str,f64,str
"""count""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""",30274.0,"""30274"""
"""null_count""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""",0.0,"""0""","""0""","""0""","""0""","""0""","""0""","""0""","""22068""",0.0,"""0"""
"""mean""",,,,,,,,,,,,2016.437075,,,,,,,,,1436900.0,
"""std""",,,,,,,,,,,,3.973605,,,,,,,,,2134200.0,
"""min""","""D1982769""","""Aston Martin R…","""Bangalore""","""1 Crore""","""0""","""CNG""","""Automatic""","""Comprehensive""","""Aston Martin""","""1 Series""","""1.0 Kappa Magn…",1900.0,"""Apr""","""ahmedabad""","""Ahmedabad""","""4 or More""","""23BH""",""" Outer Ring Ro…",""" ""","""Corporate""",40000.0,"""1-Series"""
"""25%""",,,,,,,,,,,,2014.0,,,,,,,,,453775.0,
"""50%""",,,,,,,,,,,,2017.0,,,,,,,,,750000.0,
"""75%""",,,,,,,,,,,,2019.0,,,,,,,,,1525000.0,
"""max""","""S2805517""","""Volvo XC90 Mom…","""Mumbai""","""99.99 Lakh""","""999""","""Petrol + Petro…","""Manual""","""ThirdParty""","""Volvo""","""redi-GO [2016-…","""xDrive40i Spor…",2023.0,"""Sep""","""udupi""","""Udupi""","""UnRegistered C…","""west Delhi""","""kaikondrahalli…","""wb02as2431""","""Taxi""",54500000.0,"""iX"""


* analysing and cleaning each column

##### `profileId`

In [358]:
# first clearing out the leading and trailing spaces

df_overview_B['profileId'].str.strip_chars()


# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('profileId').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('profileId')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('profileId')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('profileId')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('profileId')==None).shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('profileId').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('profileId')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 



In [359]:
# checking profileId for any irregularities in the values
# the data seems to be consistent
# profileId starts with either S or D, so the data here is consistent

set(df_overview_B['profileId'].str.slice(0,2))

{'D1', 'D2', 'D3', 'D4', 'S2'}

##### `vehicle`

In [360]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('vehicle').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('vehicle').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('vehicle')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('vehicle')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('vehicle')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('vehicle')==None).shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('profileId').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('profileId')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 



In [361]:
# data is consistent here as well
# will not be specifically used for analysis, but is kept to retain the original description as shown in the website

set(df_overview_B['vehicle'].str.split(' ').list[0])

{'Aston',
 'Audi',
 'BMW',
 'Bentley',
 'Chevrolet',
 'Chrysler',
 'Citroen',
 'Datsun',
 'Fiat',
 'Force',
 'Ford',
 'Hindustan',
 'Honda',
 'Hummer',
 'Hyundai',
 'Isuzu',
 'Jaguar',
 'Jeep',
 'Kia',
 'Lamborghini',
 'Land',
 'Lexus',
 'MG',
 'MINI',
 'Mahindra',
 'Mahindra-Renault',
 'Maruti',
 'Maserati',
 'Mercedes-Benz',
 'Mitsubishi',
 'Nissan',
 'Opel',
 'Porsche',
 'Renault',
 'Rolls-Royce',
 'Skoda',
 'Ssangyong',
 'Tata',
 'Toyota',
 'Volkswagen',
 'Volvo'}

##### `cityName`, `city`, `state`, `cityMaskingName`, and `carAvailbaleAt`

In [362]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('city').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('city').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('city')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('city')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('city')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('city')==None).shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('city').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('city')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 



In [363]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('cityName').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('cityName').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('cityName')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('cityName')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('cityName')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('cityName')==None).shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('cityName').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('cityName')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 



In [364]:
df_overview_B.filter(pl.col('city') != pl.col('cityName')).shape[0]

2536

In [365]:
df_overview_B.filter(pl.col('city') == pl.col('cityName')).shape[0]

27738

* around 2536 records have incorrrect cities
* so cityName should be the correct city since, this data was embedded in the source extracted json
* there are to Not Applicable values in the cityName, which can be filled using cityMaskingName or carAvailableAt data as all the 3 fields will have similar values
* Hence we will be using cityName column instead of city

In [366]:
set(df_overview_B['cityName'].unique())

{'Ahmedabad',
 'Badlapur',
 'Bangalore',
 'Chandigarh',
 'Chennai',
 'Coimbatore',
 'Dak. Kannada',
 'Dehradun',
 'Delhi',
 'Faridabad',
 'Ghaziabad',
 'Gurgaon',
 'Hyderabad',
 'Karnal',
 'Lucknow',
 'Ludhiana',
 'Madurai',
 'Mangalore',
 'Meerut',
 'Mohali',
 'Mumbai',
 'Nashik',
 'Navi Mumbai',
 'Noida',
 'Pune',
 'Thane',
 'Udupi'}

* the city data is consistent, so, other city related columns will not be required
* state data is not available and has to be updated.

In [367]:
# updating state data extracted from CHAT-GPT
city_state_update = pl.read_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\01.Data Extraction\city_state_missing_2.csv')

In [368]:
city_state_update.head(2)

cityName,State
str,str
"""Ahmedabad""","""Gujarat"""
"""Badlapur""","""Maharashtra"""


In [369]:
# joining the state data on main dataframe
df_overview_B = df_overview_B.join(other= city_state_update,on = 'cityName', how = 'left')


In [370]:
df_overview_B.shape

(30274, 23)

In [371]:
df_overview_B.columns

['profileId',
 'vehicle',
 'city',
 'price',
 'kilometers',
 'fuelName',
 'transmissionType',
 'insurance',
 'makeName',
 'modelName',
 'versionName',
 'makeYear',
 'makeMonth',
 'cityMaskingName',
 'cityName',
 'noOfOwners',
 'registerCity',
 'carAvailbaleAt',
 'registrationNumber',
 'regType',
 'priceNumeric',
 'rootName',
 'State']

In [372]:
# deleting all other columns

df_overview_B = df_overview_B.drop(['city','cityMaskingName','carAvailbaleAt','registerCity','registrationNumber'])

In [373]:
df_overview_B.describe()

describe,profileId,vehicle,price,kilometers,fuelName,transmissionType,insurance,makeName,modelName,versionName,makeYear,makeMonth,cityName,noOfOwners,regType,priceNumeric,rootName,State
str,str,str,str,str,str,str,str,str,str,str,f64,str,str,str,str,f64,str,str
"""count""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274"""
"""null_count""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""","""0""",0.0,"""0""","""0""","""0""","""22068""",0.0,"""0""","""0"""
"""mean""",,,,,,,,,,,2016.437075,,,,,1436900.0,,
"""std""",,,,,,,,,,,3.973605,,,,,2134200.0,,
"""min""","""D1982769""","""Aston Martin R…","""1 Crore""","""0""","""CNG""","""Automatic""","""Comprehensive""","""Aston Martin""","""1 Series""","""1.0 Kappa Magn…",1900.0,"""Apr""","""Ahmedabad""","""4 or More""","""Corporate""",40000.0,"""1-Series""","""Chandigarh"""
"""25%""",,,,,,,,,,,2014.0,,,,,453775.0,,
"""50%""",,,,,,,,,,,2017.0,,,,,750000.0,,
"""75%""",,,,,,,,,,,2019.0,,,,,1525000.0,,
"""max""","""S2805517""","""Volvo XC90 Mom…","""99.99 Lakh""","""999""","""Petrol + Petro…","""Manual""","""ThirdParty""","""Volvo""","""redi-GO [2016-…","""xDrive40i Spor…",2023.0,"""Sep""","""Udupi""","""UnRegistered C…","""Taxi""",54500000.0,"""iX""","""Uttarakhand"""


##### `kilometers`

In [374]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('kilometers').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('kilometers').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('kilometers')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('kilometers')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('kilometers')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('kilometers')==None).shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('kilometers').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('kilometers')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 5 
 None 0 



* a few '0' columns can be found

In [375]:
df_overview_B['kilometers'].head()

kilometers
str
"""39,420"""
"""38,703"""
"""49,209"""
"""36,464"""
"""65,000"""
"""38,590"""
"""12,400"""
"""75,600"""
"""36,893"""
"""83,000"""


In [376]:
df_overview_B = df_overview_B.with_columns(pl.col('kilometers').str.replace_all(',',''))

In [377]:
# changing the datatype to float
df_overview_B = df_overview_B.with_columns(pl.col('kilometers').cast(pl.Float64))

In [378]:
df_overview_B.head()

profileId,vehicle,price,kilometers,fuelName,transmissionType,insurance,makeName,modelName,versionName,makeYear,makeMonth,cityName,noOfOwners,regType,priceNumeric,rootName,State
str,str,str,f64,str,str,str,str,str,str,i64,str,str,str,str,i64,str,str
"""D4135089""","""Tata Harrier X…","""18.95 Lakh""",39420.0,"""Diesel""","""Manual""","""Comprehensive""","""Tata""","""Harrier [2019-…","""XZ Dark Editio…",2021,"""Jun""","""Bangalore""","""First""",,1895000,"""Harrier""","""Karnataka"""
"""D4170587""","""BMW 3 Series G…","""34.95 Lakh""",38703.0,"""Diesel""","""Automatic""","""Comprehensive""","""BMW""","""3 Series GT [2…","""320d Luxury Li…",2017,"""Jun""","""Bangalore""","""First""",,3495000,"""3-Series""","""Karnataka"""
"""D4170559""","""Jeep Compass L…","""14.45 Lakh""",49209.0,"""Diesel""","""Manual""","""Comprehensive""","""Jeep""","""Compass [2017-…","""Limited 2.0 Di…",2017,"""Jun""","""Bangalore""","""Second""",,1445000,"""Compass""","""Karnataka"""
"""D4112409""","""Mercedes-Benz …","""56.76 Lakh""",36464.0,"""Diesel""","""Automatic (TC)…","""Comprehensive""","""Mercedes-Benz""","""C-Class""","""C 220d""",2022,"""May""","""Bangalore""","""First""",,5676000,"""C-Class""","""Karnataka"""
"""D4087571""","""Jeep Compass L…","""17.45 Lakh""",65000.0,"""Petrol""","""Automatic""","""Comprehensive""","""Jeep""","""Compass [2017-…","""Limited Plus P…",2019,"""Jun""","""Bangalore""","""First""",,1745000,"""Compass""","""Karnataka"""


In [379]:
# 0 kilometers
df_overview_B.filter(pl.col('kilometers')==0).shape

(5, 18)

##### `fuelName`

In [380]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('fuelName').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('fuelName').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('fuelName')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('fuelName')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('fuelName')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('fuelName')==None).shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('fuelName').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('fuelName')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 



In [381]:
set(df_overview_B['fuelName'].unique())

{'CNG',
 'CNG + CNG',
 'CNG + LPG',
 'CNG + Petrol',
 'CNG + Petrol+Cng',
 'Diesel',
 'Diesel + CNG',
 'Diesel + Diesel',
 'Diesel + LPG',
 'Diesel + Petrol',
 'Electric',
 'Electric + Electric',
 'Electric + LPG',
 'Hybrid',
 'Hybrid + Hybrid(Ele',
 'Hybrid + Petrol',
 'LPG',
 'LPG + CNG',
 'LPG + LPG',
 'LPG + Petrol',
 'LPG + Petrol+Lpg',
 'Mild Hybrid(Electric + Petrol)',
 'Mild Hybrid(Electric + Petrol) + Hybrid(Ele',
 'Petrol',
 'Petrol + CNG',
 'Petrol + Diesel',
 'Petrol + Hybrid(Ele',
 'Petrol + LPG',
 'Petrol + Petrol',
 'Petrol + Petrol+Cng'}

* fuel names are inconsistent, and can be compared with fuel in specifications

##### `transmissionType`

In [382]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('transmissionType').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('transmissionType').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('transmissionType')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('transmissionType')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('transmissionType')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('transmissionType')==None).shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('transmissionType').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('transmissionType')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 



In [383]:
list(df_overview_B['transmissionType'].unique())

['Automatic (EV/Hybrid)',
 'Automatic (TC)',
 'Automatic (DCT)',
 'Manual',
 'Automatic (CVT)',
 'Automatic (AMT)',
 'Clutchless Manual (IMT)',
 'Automatic (e-CVT)',
 'Automatic']

* grouping all the transmissions into 2 basic categories
    * Automatic
    * Manual

In [384]:
df_overview_B = df_overview_B.with_columns(pl.col('transmissionType').str.split(' ').list[0].str.replace('Clutchless','Automatic').alias('transmission'))

In [385]:
df_overview_B['transmission'].unique()

transmission
str
"""Automatic"""
"""Manual"""


In [386]:
#dropping Transmission type column
df_overview_B = df_overview_B.drop('transmissionType')

In [387]:
df_overview_B.columns

['profileId',
 'vehicle',
 'price',
 'kilometers',
 'fuelName',
 'insurance',
 'makeName',
 'modelName',
 'versionName',
 'makeYear',
 'makeMonth',
 'cityName',
 'noOfOwners',
 'regType',
 'priceNumeric',
 'rootName',
 'State',
 'transmission']

##### `insurance`

In [388]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('insurance').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('insurance').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('insurance')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('insurance')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('insurance')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('insurance')==None).shape[0],'\n',
        'NA',df_overview_B.filter(pl.col('insurance')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('insurance').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('insurance')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 8376 



In [389]:
df_overview_B['insurance'].unique()

insurance
str
"""Not Available"""
"""Comprehensive"""
"""ThirdParty"""
"""Expired"""
"""Third Party"""


In [390]:
df_overview_B = df_overview_B.with_columns(pl.col('insurance').str.replace("ThirdParty","Third Party"))

In [391]:
df_overview_B.describe()

describe,profileId,vehicle,price,kilometers,fuelName,insurance,makeName,modelName,versionName,makeYear,makeMonth,cityName,noOfOwners,regType,priceNumeric,rootName,State,transmission
str,str,str,str,f64,str,str,str,str,str,f64,str,str,str,str,f64,str,str,str
"""count""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274"""
"""null_count""","""0""","""0""","""0""",0.0,"""0""","""0""","""0""","""0""","""0""",0.0,"""0""","""0""","""0""","""22068""",0.0,"""0""","""0""","""0"""
"""mean""",,,,58325.222567,,,,,,2016.437075,,,,,1436900.0,,,
"""std""",,,,55097.712796,,,,,,3.973605,,,,,2134200.0,,,
"""min""","""D1982769""","""Aston Martin R…","""1 Crore""",0.0,"""CNG""","""Comprehensive""","""Aston Martin""","""1 Series""","""1.0 Kappa Magn…",1900.0,"""Apr""","""Ahmedabad""","""4 or More""","""Corporate""",40000.0,"""1-Series""","""Chandigarh""","""Automatic"""
"""25%""",,,,30000.0,,,,,,2014.0,,,,,453775.0,,,
"""50%""",,,,52000.0,,,,,,2017.0,,,,,750000.0,,,
"""75%""",,,,76000.0,,,,,,2019.0,,,,,1525000.0,,,
"""max""","""S2805517""","""Volvo XC90 Mom…","""99.99 Lakh""",3950000.0,"""Petrol + Petro…","""Third Party""","""Volvo""","""redi-GO [2016-…","""xDrive40i Spor…",2023.0,"""Sep""","""Udupi""","""UnRegistered C…","""Taxi""",54500000.0,"""iX""","""Uttarakhand""","""Manual"""


##### `makeName`,`modelName`, `rootName` and `versionName`

In [392]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('makeName').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('makeName').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('makeName')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('makeName')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('makeName')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('makeName')==None).shape[0],'\n',
        'NA',df_overview_B.filter(pl.col('makeName')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('makeName').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('makeName')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



In [393]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('modelName').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('modelName').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('modelName')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('modelName')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('modelName')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('modelName')==None).shape[0],'\n',
        'NA',df_overview_B.filter(pl.col('modelName')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('modelName').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('modelName')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



In [394]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('rootName').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('rootName').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('rootName')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('rootName')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('rootName')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('rootName')==None).shape[0],'\n',
        'NA',df_overview_B.filter(pl.col('rootName')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('rootName').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('rootName')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



In [395]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('versionName').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('versionName').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('versionName')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('versionName')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('versionName')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('versionName')==None).shape[0],'\n',
        'NA',df_overview_B.filter(pl.col('versionName')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('versionName').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('versionName')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



In [396]:
print(list(df_overview_B['makeName'].unique()))

['Renault', 'Hummer', 'Nissan', 'Lexus', 'Chrysler', 'Lamborghini', 'Aston Martin', 'Skoda', 'Datsun', 'Fiat', 'Kia', 'BMW', 'Ford', 'Bentley', 'Jeep', 'Jaguar', 'Mahindra-Renault', 'Hindustan Motors', 'Chevrolet', 'Maruti Suzuki', 'Mitsubishi', 'Honda', 'Rolls-Royce', 'MG', 'Opel', 'MINI', 'Mercedes-Benz', 'Hyundai', 'Porsche', 'Ssangyong', 'Volkswagen', 'Force Motors', 'Toyota', 'Maserati', 'Audi', 'Mahindra', 'Isuzu', 'Tata', 'Citroen', 'Volvo', 'Land Rover']


In [397]:
print(list(df_overview_B['modelName'].unique()))

['GLB', '3 Series GT', 'Venue [2022-2023]', 'Bolero [2000-2007]', 'Bolero Neo', 'City ZX', 'X5 [2012-2014]', 'Dzire [2017-2020]', 'LS 600', 'Jazz [2009-2011]', 'Getz Prime [2007-2010]', 'Accord [2008-2011]', 'Continental Flying Spur', 'Gypsy', 'Land Cruiser Prado', 'RX [2017-2023]', 'Hector Plus [2020-2023]', 'Harrier [2019-2023]', 'GLE [2015-2020]', 'Elite i20 [2019-2020]', 'Etios Liva [2011-2013]', 'Xcent [2014-2017]', 'iX', 'Vento [2014-2015]', '2 Series Gran Coupe', 'Grand Cherokee', 'XK', 'Kwid [2019-2022]', '7 Series [2008-2013]', 'MU-X [2017-2018]', 'Cayenne', 'Gloster', 'Q3 [2012-2015]', 'City [2008-2011]', 'Civic [2010-2013]', '718', 'Laura', 'Accent [1999-2003]', 'Swift DZire [2011-2015]', 'S-Presso', 'Superb [2009-2014]', 'Optra Magnum [2007-2012]', 'B-Class [2012-2015]', 'Caravelle', 'Figo', 'XC60 [2010-2013]', 'Baleno [2019-2022]', 'Cooper [2014-2018]', 'Touareg', 'EcoSport', 'XUV300 TurboSport', 'X6 [2015-2019]', 'Innova Crysta [2020-2023]', 'Ritz [2009-2012]', 'Indica', 

In [398]:
print(list(df_overview_B['rootName'].unique()))

['Urban Cruiser Hyryder', 'Manza', 'Tigor EV', 'R8', 'XK', 'Jetta', 'Octavia', 'Esteem', 'Maybach S-Class', 'Beetle', 'Cherokee', 'Touareg', 'Corona', 'Safari', 'Spark', 'Kizashi', '190', 'Ameo', 'granroad', 'Hector Plus', 'Laura', 'E-Class', 'V90 Cross Country', 'Z4', 'Sonata', 'Aura', 'Palio', 'EQC', 'Accent', 'Marshal', 'Tavera', 'Rapid', 'Kona Electric', 'Superb', 'Q8', 'GL-Class', 'Kwid', 'Q7', 'Cooper S', 'Xcent', 'Verna', 'Amaze', 'Swift DZire', '2 Series Gran Coupe', 'XUV300', 'Sail Sedan', 'SLK-Class', 'Altroz', 'Jimny', 'Etios Liva', 'Cresta', 'Tiguan', 'XJ', '500', 'M2', 'Micra', 'XC60', 'Ghost', '1000', 'Venue', 'e2o', 'Quanto', 'Q5', 'S5 Sportback', 'Outlander', 'V40 Cross Country', 'Sonet', 'GTi', 'Hexa', 'Eeco', 'Optra', 'F-type', 'MU-X', '6-Series', 'Zen', 'Lodgy', 'Magnite', 'Qualis', 'Xylo', 'C-Class', 'Bolero', 'Omni', 'AMG GLA 35', 'Defender', 'Gypsy', 'Serena', 'Hector', 'G-Class', 'S90', 'S-Class', 'B-class', 'Punto EVO', 'Figo', 'Carens', '5-Series', 'A5', 'Hurac

In [399]:
print(list(df_overview_B['versionName'].unique()))

['LX', '1.8 J', 'EX 1.6 CRDi AT [2017-2018]', 'Magna 1.1 iRDE2 [2010-2017]', 'LXi Minor', '320d M Sport', 'XZA Plus Petrol Dark Edtion [2022-2023]', 'Monte Carlo 1.0 TSI AT', 'W8(O) Dual Tone', 'RXZ 1.5 Petrol MT', '1.2 SX Dual Tone', 'EX 1.4 CRDi', 'Magna 1.2 Kappa VTVT [2017-2020]', 'LXI CNG (O)', 'S', 'E250 CDI Classic', 'Highline Plus 1.5L (D)16 Alloy', 'Asta 1.2 with Sunroof', 'Zeta 1.3 Diesel', 'LXi [2021-2023]', '118d Hatchback', 'Active 1.3', 'W4 1.5 Diesel [2020]', '1.4  TSI Ambition', 'GLX 1.4', 'VXi Minor', 'Sportz Plus 1.2 Dual Tone [2019-2020]', 'XM', 'Super 1.5 Petrol [2019-2020]', 'LXi (O)', 'Connect 1.5 DLS', 'Asta 1.4 AT with AVN', 'Alpha Plus Intelligent Hybrid eCVT', '530i Sedan', 'EXi 1.4 Durasport', 'XL (P)', 'Luxury 1.4 Petrol 7 STR', 'Xeta eGL BS-IV', 'RxE Petrol', '630i M Sport [2021-2023]', 'XE CNG', '1.5 V MT', 'L&K TDI AT', '220 d Progressive', 'T4', '450d', 'E Diesel', 'Alpha Dual Tone eCVT', '1.5 TDI CR Ambition Plus AT', '2.4 GX Limited Edition AT 8 STR', 

* modelName contains year data so utilizing rootName as modelName
* versionName contains version of the Model of car, so keeping the column data intact

In [400]:
# dropping modelName and renaming rootName to model

df_overview_B = df_overview_B.drop(['modelName'])

In [401]:
df_overview_B = df_overview_B.rename({'rootName':'modelName'})

In [402]:
df_overview_B.columns

['profileId',
 'vehicle',
 'price',
 'kilometers',
 'fuelName',
 'insurance',
 'makeName',
 'versionName',
 'makeYear',
 'makeMonth',
 'cityName',
 'noOfOwners',
 'regType',
 'priceNumeric',
 'modelName',
 'State',
 'transmission']

##### `makeYear` and `makeMonth`

In [403]:
list(df_overview_B['makeMonth'].unique())

['Dec',
 'Aug',
 'Sep',
 'May',
 'Mar',
 'Apr',
 'Jan',
 'Oct',
 'Jun',
 'Nov',
 'Jul',
 'Feb']

In [404]:
list(df_overview_B['makeYear'].unique())

[1900,
 1988,
 1991,
 1993,
 1994,
 1995,
 1996,
 1998,
 1999,
 2000,
 2001,
 2002,
 2003,
 2004,
 2005,
 2006,
 2007,
 2008,
 2009,
 2010,
 2011,
 2012,
 2013,
 2014,
 2015,
 2016,
 2017,
 2018,
 2019,
 2020,
 2021,
 2022,
 2023]

* makeYear and makeMonth are consistent and no changes are required.

In [405]:
df_overview_B.columns

['profileId',
 'vehicle',
 'price',
 'kilometers',
 'fuelName',
 'insurance',
 'makeName',
 'versionName',
 'makeYear',
 'makeMonth',
 'cityName',
 'noOfOwners',
 'regType',
 'priceNumeric',
 'modelName',
 'State',
 'transmission']

##### `noOfOwners`

In [406]:
list(df_overview_B['noOfOwners'].unique())

['4 or More', 'Second', 'Third', 'UnRegistered Car', 'First', 'Fourth']

* data is consistent
* changing 'UnRegistered Car' to 'Unregistered Car'

In [407]:
df_overview_B = df_overview_B.with_columns(pl.col('noOfOwners').str.replace('UnRegistered Car','Unregistered Car'))

In [408]:
df_overview_B.columns

['profileId',
 'vehicle',
 'price',
 'kilometers',
 'fuelName',
 'insurance',
 'makeName',
 'versionName',
 'makeYear',
 'makeMonth',
 'cityName',
 'noOfOwners',
 'regType',
 'priceNumeric',
 'modelName',
 'State',
 'transmission']

##### `regType`

In [409]:
list(df_overview_B['regType'].unique())

[None, 'Corporate', 'Individual', 'Taxi']

* this data is consistent and does not require any changes

##### `priceNumeric` and `price`

In [410]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('price').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('priceNumeric').is_null()).shape[0],'\n',
        #'empty',df_overview_B.filter(pl.col('priceNumeric')=='').shape[0],'\n',
        #'space', df_overview_B.filter(pl.col('priceNumeric')==' ').shape[0],'\n',
        #'0 (Str)',df_overview_B.filter(pl.col('priceNumeric')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('priceNumeric')==None).shape[0],'\n',
        #'NA',df_overview_B.filter(pl.col('priceNumeric')=='Not Available').shape[0],'\n',
    
        #numeric
        'nan', df_overview_B.filter(pl.col('priceNumeric').is_nan()).shape[0],'\n',
        '0', df_overview_B.filter(pl.col('priceNumeric')==0).shape[0],'\n',
        

        )

null 0 
 None 0 
 nan 0 
 0 0 



In [411]:
# first clearing out the leading and trailing spaces

df_overview_B = df_overview_B.with_columns(pl.col('price').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_overview_B.filter(pl.col('price').is_null()).shape[0],'\n',
        'empty',df_overview_B.filter(pl.col('price')=='').shape[0],'\n',
        'space', df_overview_B.filter(pl.col('price')==' ').shape[0],'\n',
        '0 (Str)',df_overview_B.filter(pl.col('price')=='0').shape[0],'\n',
        'None',df_overview_B.filter(pl.col('price')==None).shape[0],'\n',
        'NA',df_overview_B.filter(pl.col('price')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_overview_B.filter(pl.col('versionName').is_nan()).shape[0],'\n',
        #'0', df_overview_B.filter(pl.col('versionName')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



* no inconsistencies in priceNumeric column found
* so dropping price and using priceNumeric column instead

In [412]:
df_overview_B['priceNumeric'].sort(descending= False)

priceNumeric
i64
40000
50000
50000
50000
50000
50000
55000
60000
65000
69000


In [413]:
df_overview_B = df_overview_B.drop('price')

In [414]:
df_overview_B.describe()

describe,profileId,vehicle,kilometers,fuelName,insurance,makeName,versionName,makeYear,makeMonth,cityName,noOfOwners,regType,priceNumeric,modelName,State,transmission
str,str,str,f64,str,str,str,str,f64,str,str,str,str,f64,str,str,str
"""count""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274"""
"""null_count""","""0""","""0""",0.0,"""0""","""0""","""0""","""0""",0.0,"""0""","""0""","""0""","""22068""",0.0,"""0""","""0""","""0"""
"""mean""",,,58325.222567,,,,,2016.437075,,,,,1436900.0,,,
"""std""",,,55097.712796,,,,,3.973605,,,,,2134200.0,,,
"""min""","""D1982769""","""Aston Martin R…",0.0,"""CNG""","""Comprehensive""","""Aston Martin""","""1.0 Kappa Magn…",1900.0,"""Apr""","""Ahmedabad""","""4 or More""","""Corporate""",40000.0,"""1-Series""","""Chandigarh""","""Automatic"""
"""25%""",,,30000.0,,,,,2014.0,,,,,453775.0,,,
"""50%""",,,52000.0,,,,,2017.0,,,,,750000.0,,,
"""75%""",,,76000.0,,,,,2019.0,,,,,1525000.0,,,
"""max""","""S2805517""","""Volvo XC90 Mom…",3950000.0,"""Petrol + Petro…","""Third Party""","""Volvo""","""xDrive40i Spor…",2023.0,"""Sep""","""Udupi""","""Unregistered C…","""Taxi""",54500000.0,"""iX""","""Uttarakhand""","""Manual"""


In [415]:
#changine priceName to price
df_overview_B = df_overview_B.rename({'priceNumeric':'price'})

In [416]:
df_overview_B.describe()

describe,profileId,vehicle,kilometers,fuelName,insurance,makeName,versionName,makeYear,makeMonth,cityName,noOfOwners,regType,price,modelName,State,transmission
str,str,str,f64,str,str,str,str,f64,str,str,str,str,f64,str,str,str
"""count""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274""","""30274""",30274.0,"""30274""","""30274""","""30274"""
"""null_count""","""0""","""0""",0.0,"""0""","""0""","""0""","""0""",0.0,"""0""","""0""","""0""","""22068""",0.0,"""0""","""0""","""0"""
"""mean""",,,58325.222567,,,,,2016.437075,,,,,1436900.0,,,
"""std""",,,55097.712796,,,,,3.973605,,,,,2134200.0,,,
"""min""","""D1982769""","""Aston Martin R…",0.0,"""CNG""","""Comprehensive""","""Aston Martin""","""1.0 Kappa Magn…",1900.0,"""Apr""","""Ahmedabad""","""4 or More""","""Corporate""",40000.0,"""1-Series""","""Chandigarh""","""Automatic"""
"""25%""",,,30000.0,,,,,2014.0,,,,,453775.0,,,
"""50%""",,,52000.0,,,,,2017.0,,,,,750000.0,,,
"""75%""",,,76000.0,,,,,2019.0,,,,,1525000.0,,,
"""max""","""S2805517""","""Volvo XC90 Mom…",3950000.0,"""Petrol + Petro…","""Third Party""","""Volvo""","""xDrive40i Spor…",2023.0,"""Sep""","""Udupi""","""Unregistered C…","""Taxi""",54500000.0,"""iX""","""Uttarakhand""","""Manual"""


#### EXPORTING OVERVIEW DATA

In [417]:
df_overview_B.write_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-II\overview_B.csv')

### `Specifications` DATA

In [418]:
df_specifications_B = pl.read_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-I\specifications_B.csv')

In [419]:
df_specifications_B.shape

(894201, 8)

In [420]:
df_specifications_B.describe()

describe,specName,specValue,specUnit,spec_category,carId,profileId,vehicle,city
str,str,str,str,str,f64,str,str,str
"""count""","""894201""","""894201""","""894201""","""894201""",894201.0,"""894201""","""894201""","""894201"""
"""null_count""","""0""","""0""","""525576""","""0""",0.0,"""0""","""0""","""0"""
"""mean""",,,,,3559.244744,,,
"""std""",,,,,2392.10689,,,
"""min""","""Acceleration (…",""" BIFUEL CNG 1…","""Doors""","""Capacity""",0.0,"""D1982769""","""Aston Martin R…","""Bangalore"""
"""25%""",,,,,1552.0,,,
"""50%""",,,,,3194.0,,,
"""75%""",,,,,5338.0,,,
"""max""","""Width""","""winPower Turbo…","""seconds""","""Suspensions, B…",9450.0,"""S2805517""","""Volvo XC90 Mom…","""Mumbai"""


#### REMOVING DUPLICATES

* profileId and specName will be the unique key to identify each record.
* for each car (profileId) 

In [421]:
# taking a copy of the original dataframe before removing duplicates
df_specifications_B_1 = df_specifications_B.clone()

In [422]:
df_specifications_B[['profileId','specName']].unique()

profileId,specName
str,str
"""D4135089""","""Fuel Type """
"""D4135089""","""Max Power (bhp…"
"""D4135089""","""Max Torque (Nm…"
"""D4135089""","""Transmission"""
"""D4135089""","""Turbocharger /…"
"""D4135089""","""Alternate Fuel…"
"""D4135089""","""Length"""
"""D4135089""","""Width"""
"""D4135089""","""Wheelbase"""
"""D4135089""","""Rear Brake Typ…"


In [423]:
df_specifications_B.group_by(['profileId','specName']).count().filter(pl.col('count')>1)

profileId,specName,count
str,str,u32
"""D4111595""","""Top Speed""",2
"""D4111595""","""Emission Stand…",2
"""D4111595""","""Four Wheel Ste…",2
"""D4111595""","""Front Suspensi…",2
"""D4111595""","""Steering Type""",2
"""D4116985""","""Mileage (ARAI)…",2
"""D4116985""","""Drivetrain""",2
"""D4116985""","""Minimum Turnin…",2
"""D4116985""","""Front Tyres""",2
"""D4116985""","""Rear Tyres""",2


In [424]:
df_specifications_B.filter((pl.col('profileId') == "D4116985") & (pl.col('specName') == "Engine"))

specName,specValue,specUnit,spec_category,carId,profileId,vehicle,city
str,str,str,str,i64,str,str,str
"""Engine""","""3198 cc, 4 Cyl…",,"""Engine & Trans…",44,"""D4116985""","""Ford Endeavour…","""Bangalore"""
"""Engine""","""3198 cc, 4 Cyl…",,"""Engine & Trans…",1844,"""D4116985""","""Ford Endeavour…","""Hyderabad"""


In [425]:
df_specifications_B.unique(['profileId','specName']).filter((pl.col('profileId') == "D4116985") & (pl.col('specName') == "Engine"))

specName,specValue,specUnit,spec_category,carId,profileId,vehicle,city
str,str,str,str,i64,str,str,str
"""Engine""","""3198 cc, 4 Cyl…",,"""Engine & Trans…",44,"""D4116985""","""Ford Endeavour…","""Bangalore"""


In [426]:
# getting unique values by profileID and specName

df_specifications_B = df_specifications_B.unique(['profileId','specName'])

In [427]:
# checking
df_specifications_B.filter((pl.col('profileId') == "D4116985") & (pl.col('specName') == "Engine"))

specName,specValue,specUnit,spec_category,carId,profileId,vehicle,city
str,str,str,str,i64,str,str,str
"""Engine""","""3198 cc, 4 Cyl…",,"""Engine & Trans…",44,"""D4116985""","""Ford Endeavour…","""Bangalore"""


#### COLUMN-WISE UNDERSTANDING OF DATA AND CLEANING

In [428]:
df_specifications_B_1 = df_specifications_B.clone()

In [429]:
df_specifications_B.head(2)

specName,specValue,specUnit,spec_category,carId,profileId,vehicle,city
str,str,str,str,i64,str,str,str
"""Engine Type""","""2.0 L Kryotec""",,"""Engine & Trans…",0,"""D4135089""","""Tata Harrier X…","""Bangalore"""
"""Fuel Type ""","""Diesel""",,"""Engine & Trans…",0,"""D4135089""","""Tata Harrier X…","""Bangalore"""


* city, carId columns are not need, hence deleting them

In [430]:
df_specifications_B = df_specifications_B.drop(['carId','city'])

* analysing columns

##### `specName`

In [431]:
# first clearing out the leading and trailing spaces

df_specifications_B = df_specifications_B.with_columns(pl.col('specName').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specifications_B.filter(pl.col('specName').is_null()).shape[0],'\n',
        'empty',df_specifications_B.filter(pl.col('specName')=='').shape[0],'\n',
        'space', df_specifications_B.filter(pl.col('specName')==' ').shape[0],'\n',
        '0 (Str)',df_specifications_B.filter(pl.col('specName')=='0').shape[0],'\n',
        'None',df_specifications_B.filter(pl.col('specName')==None).shape[0],'\n',
        'NA',df_specifications_B.filter(pl.col('specName')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specifications_B.filter(pl.col('specName').is_nan()).shape[0],'\n',
        #'0', df_specifications_B.filter(pl.col('specName')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



In [432]:
 list(df_specifications_B['specName'].unique())

['Highway Mileage (CarWale Tested)',
 'Turbocharger / Supercharger',
 'Mileage (ARAI)',
 'Drivetrain',
 'Four Wheel Steering',
 'Length',
 'Transmission',
 'Battery',
 'Ground Clearance',
 'Battery Charging',
 'Emission Standard',
 'Performance on Alternate Fuel',
 'Others',
 'Electric Motor',
 'Max Torque (Nm@rpm)',
 'Max Power (bhp@rpm)',
 'Spare Wheel',
 'Fuel Tank Capacity',
 'Width',
 'Electric Motor Assist',
 'City Mileage (CarWale Tested)',
 'Max Motor Performance',
 'Driving Range',
 'Height',
 'Bootspace',
 'Rear Suspension',
 'No of Seating Rows',
 'Wheels',
 'Kerb Weight',
 'Front Tyres',
 'Steering Type',
 'Minimum Turning Radius',
 'Front Suspension',
 'Top Speed',
 'Acceleration (0-100 kmph)',
 'Alternate Fuel',
 'Engine',
 'Rear Brake Type',
 'Engine Type',
 'Front Brake Type',
 'Fuel Type',
 'Seating Capacity',
 'Wheelbase',
 'Doors',
 'Rear Tyres',
 'Range (Carwale Tested)']

* no notable discrepancies

##### `specValue`

In [433]:
# first clearing out the leading and trailing spaces

df_specifications_B = df_specifications_B.with_columns(pl.col('specValue').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specifications_B.filter(pl.col('specValue').is_null()).shape[0],'\n',
        'empty',df_specifications_B.filter(pl.col('specValue')=='').shape[0],'\n',
        'space', df_specifications_B.filter(pl.col('specValue')==' ').shape[0],'\n',
        '0 (Str)',df_specifications_B.filter(pl.col('specValue')=='0').shape[0],'\n',
        'None',df_specifications_B.filter(pl.col('specValue')==None).shape[0],'\n',
        'NA',df_specifications_B.filter(pl.col('specValue')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specifications_B.filter(pl.col('specValue').is_nan()).shape[0],'\n',
        #'0', df_specifications_B.filter(pl.col('specValue')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 6487 
 None 0 
 NA 0 



* there are a lot of '0' values.
* analysis to be done by specification category.
* keeping the column for now

##### `specUnit`

In [434]:
# first clearing out the leading and trailing spaces

df_specifications_B = df_specifications_B.with_columns(pl.col('specUnit').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specifications_B.filter(pl.col('specUnit').is_null()).shape[0],'\n',
        'empty',df_specifications_B.filter(pl.col('specUnit')=='').shape[0],'\n',
        'space', df_specifications_B.filter(pl.col('specUnit')==' ').shape[0],'\n',
        '0 (Str)',df_specifications_B.filter(pl.col('specUnit')=='0').shape[0],'\n',
        'None',df_specifications_B.filter(pl.col('specUnit')==None).shape[0],'\n',
        'NA',df_specifications_B.filter(pl.col('specUnit')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specifications_B.filter(pl.col('specUnit').is_nan()).shape[0],'\n',
        #'0', df_specifications_B.filter(pl.col('specUnit')==0).shape[0],'\n',
        

        )

null 516383 
 empty 0 
 space 0 
 0 (Str) 0 
 None 516383 
 NA 0 



In [435]:
list(df_specifications_B['specUnit'].unique())

['Doors',
 'Km',
 'kmpl',
 'Person',
 'km/full charge',
 'litres',
 'Rows',
 'Kmph',
 'km/kg',
 'mm',
 'seconds',
 'kg',
 None,
 'metres']

##### `profileId`

In [436]:
# first clearing out the leading and trailing spaces

df_specifications_B = df_specifications_B.with_columns(pl.col('profileId').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specifications_B.filter(pl.col('profileId').is_null()).shape[0],'\n',
        'empty',df_specifications_B.filter(pl.col('profileId')=='').shape[0],'\n',
        'space', df_specifications_B.filter(pl.col('profileId')==' ').shape[0],'\n',
        '0 (Str)',df_specifications_B.filter(pl.col('profileId')=='0').shape[0],'\n',
        'None',df_specifications_B.filter(pl.col('profileId')==None).shape[0],'\n',
        'NA',df_specifications_B.filter(pl.col('profileId')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specifications_B.filter(pl.col('profileId').is_nan()).shape[0],'\n',
        #'0', df_specifications_B.filter(pl.col('profileId')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



* no discrepancies in profileId

##### `vehicle`

In [437]:
# first clearing out the leading and trailing spaces

df_specifications_B = df_specifications_B.with_columns(pl.col('vehicle').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specifications_B.filter(pl.col('vehicle').is_null()).shape[0],'\n',
        'empty',df_specifications_B.filter(pl.col('vehicle')=='').shape[0],'\n',
        'space', df_specifications_B.filter(pl.col('vehicle')==' ').shape[0],'\n',
        '0 (Str)',df_specifications_B.filter(pl.col('vehicle')=='0').shape[0],'\n',
        'None',df_specifications_B.filter(pl.col('vehicle')==None).shape[0],'\n',
        'NA',df_specifications_B.filter(pl.col('vehicle')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specifications_B.filter(pl.col('vehicle').is_nan()).shape[0],'\n',
        #'0', df_specifications_B.filter(pl.col('vehicle')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



* no discrepancies in vehicle data

##### `spec_category`

In [438]:
# first clearing out the leading and trailing spaces

df_specifications_B = df_specifications_B.with_columns(pl.col('spec_category').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specifications_B.filter(pl.col('spec_category').is_null()).shape[0],'\n',
        'empty',df_specifications_B.filter(pl.col('spec_category')=='').shape[0],'\n',
        'space', df_specifications_B.filter(pl.col('spec_category')==' ').shape[0],'\n',
        '0 (Str)',df_specifications_B.filter(pl.col('spec_category')=='0').shape[0],'\n',
        'None',df_specifications_B.filter(pl.col('spec_category')==None).shape[0],'\n',
        'NA',df_specifications_B.filter(pl.col('spec_category')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specifications_B.filter(pl.col('spec_category').is_nan()).shape[0],'\n',
        #'0', df_specifications_B.filter(pl.col('spec_category')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



In [439]:
list(df_specifications_B['spec_category'].unique())

['Suspensions, Brakes, Steering & Tyres',
 'Dimensions & Weight',
 'Engine & Transmission',
 'Capacity']

#### LOOKING INTO SPECIFICATION CATEGORIES

* The dataset will be split by CATEGORIES for ease of analysis
* There are 4 categories and data will be split according to this:
    * 'Engine & Transmission'
    * 'Dimensions & Weight'
    * 'Capacity'
    * 'Suspensions, Brakes, Steering & Tyres'

In [440]:
df_specifications_B.filter((pl.col('spec_category')=='Engine & Transmission') & (pl.col('profileId')=='D4116985'))

specName,specValue,specUnit,spec_category,profileId,vehicle
str,str,str,str,str,str
"""Engine""","""3198 cc, 4 Cyl…",,"""Engine & Trans…","""D4116985""","""Ford Endeavour…"
"""Max Torque (Nm…","""470 Nm @ 1750 …",,"""Engine & Trans…","""D4116985""","""Ford Endeavour…"
"""Engine Type""","""3.2 l TDCi""",,"""Engine & Trans…","""D4116985""","""Ford Endeavour…"
"""Mileage (ARAI)…","""10.91""","""kmpl""","""Engine & Trans…","""D4116985""","""Ford Endeavour…"
"""Transmission""","""Automatic - 6 …",,"""Engine & Trans…","""D4116985""","""Ford Endeavour…"
"""Turbocharger /…","""Turbocharged""",,"""Engine & Trans…","""D4116985""","""Ford Endeavour…"
"""Fuel Type""","""Diesel""",,"""Engine & Trans…","""D4116985""","""Ford Endeavour…"
"""Max Power (bhp…","""197 bhp @ 3000…",,"""Engine & Trans…","""D4116985""","""Ford Endeavour…"
"""Drivetrain""","""4WD / AWD""",,"""Engine & Trans…","""D4116985""","""Ford Endeavour…"


In [441]:
df_specifications_B.filter((pl.col('spec_category')=='Dimensions & Weight') & (pl.col('profileId')=='D4116985'))

specName,specValue,specUnit,spec_category,profileId,vehicle
str,str,str,str,str,str
"""Wheelbase""","""2850""","""mm""","""Dimensions & W…","""D4116985""","""Ford Endeavour…"
"""Ground Clearan…","""225""","""mm""","""Dimensions & W…","""D4116985""","""Ford Endeavour…"
"""Length""","""4892""","""mm""","""Dimensions & W…","""D4116985""","""Ford Endeavour…"
"""Width""","""1860""","""mm""","""Dimensions & W…","""D4116985""","""Ford Endeavour…"
"""Height""","""1837""","""mm""","""Dimensions & W…","""D4116985""","""Ford Endeavour…"


In [442]:
df_specifications_B.filter((pl.col('spec_category')=='Capacity') & (pl.col('profileId')=='D4116985'))

specName,specValue,specUnit,spec_category,profileId,vehicle
str,str,str,str,str,str
"""Seating Capaci…","""7""","""Person""","""Capacity""","""D4116985""","""Ford Endeavour…"
"""Fuel Tank Capa…","""80""","""litres""","""Capacity""","""D4116985""","""Ford Endeavour…"
"""Doors""","""5""","""Doors""","""Capacity""","""D4116985""","""Ford Endeavour…"
"""No of Seating …","""3""","""Rows""","""Capacity""","""D4116985""","""Ford Endeavour…"


In [443]:
df_specifications_B.filter((pl.col('spec_category')=='Suspensions, Brakes, Steering & Tyres') & (pl.col('profileId')=='D4116985'))


specName,specValue,specUnit,spec_category,profileId,vehicle
str,str,str,str,str,str
"""Rear Brake Typ…","""Disc""",,"""Suspensions, B…","""D4116985""","""Ford Endeavour…"
"""Rear Suspensio…","""Coil spring, W…",,"""Suspensions, B…","""D4116985""","""Ford Endeavour…"
"""Front Brake Ty…","""Disc""",,"""Suspensions, B…","""D4116985""","""Ford Endeavour…"
"""Minimum Turnin…","""4.9""","""metres""","""Suspensions, B…","""D4116985""","""Ford Endeavour…"
"""Wheels""","""Alloy Wheels""",,"""Suspensions, B…","""D4116985""","""Ford Endeavour…"
"""Spare Wheel""","""Steel""",,"""Suspensions, B…","""D4116985""","""Ford Endeavour…"
"""Front Suspensi…","""Independent Co…",,"""Suspensions, B…","""D4116985""","""Ford Endeavour…"
"""Rear Tyres""","""265 / 60 R18""",,"""Suspensions, B…","""D4116985""","""Ford Endeavour…"
"""Steering Type""","""Power assisted…",,"""Suspensions, B…","""D4116985""","""Ford Endeavour…"
"""Front Tyres""","""265 / 60 R18""",,"""Suspensions, B…","""D4116985""","""Ford Endeavour…"


- The only category that might be useful for our analysis
    * 'Engine & Transmission'
    * 'Dimensions & Weight'
    * 'Capacity'
- Ignoring Suspensions, Brakes, Steering & Tyres category

#### `Engine & Transmission`

In [444]:
#filtering Engine and Tranmission
df_specs_B_Engine = df_specifications_B.filter(pl.col('spec_category')=='Engine & Transmission')

In [445]:
df_specs_B_Engine.head()

specName,specValue,specUnit,spec_category,profileId,vehicle
str,str,str,str,str,str
"""Engine Type""","""2.0 L Kryotec""",,"""Engine & Trans…","""D4135089""","""Tata Harrier X…"
"""Fuel Type""","""Diesel""",,"""Engine & Trans…","""D4135089""","""Tata Harrier X…"
"""Max Power (bhp…","""168 bhp @ 3750…",,"""Engine & Trans…","""D4135089""","""Tata Harrier X…"
"""Mileage (ARAI)…","""16.3""","""kmpl""","""Engine & Trans…","""D4135089""","""Tata Harrier X…"
"""Drivetrain""","""FWD""",,"""Engine & Trans…","""D4135089""","""Tata Harrier X…"


In [446]:
df_specs_B_Engine = df_specs_B_Engine.pivot(index = ['profileId','vehicle'],columns = 'specName', values = 'specValue')

In [447]:
df_specs_B_Engine.head()

profileId,vehicle,Engine Type,Fuel Type,Max Power (bhp@rpm),Mileage (ARAI),Drivetrain,Turbocharger / Supercharger,Transmission,Others,Emission Standard,Engine,Max Torque (Nm@rpm),Driving Range,Alternate Fuel,Top Speed,Acceleration (0-100 kmph),Electric Motor,Battery,Max Motor Performance,Battery Charging,Performance on Alternate Fuel,Range (Carwale Tested),Electric Motor Assist,City Mileage (CarWale Tested),Highway Mileage (CarWale Tested)
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""D4135089""","""Tata Harrier X…","""2.0 L Kryotec""","""Diesel""","""168 bhp @ 3750…","""16.3""","""FWD""","""Turbocharged""","""Manual - 6 Gea…",,"""BS 6""","""1956 cc, 4 Cyl…","""350 Nm @ 1750 …","""818""","""Not Applicable…",,,,,,,,,,,
"""D4170587""","""BMW 3 Series G…",,"""Diesel""","""184 bhp @ 4000…","""19.59""","""RWD""","""Turbocharged""","""Automatic - 8 …","""Idle Start/Sto…",,"""1995 cc, 4 Cyl…","""380 Nm @ 1750 …",,,,,,,,,,,,,
"""D4170559""","""Jeep Compass L…",,"""Diesel""","""171 bhp @ 3750…","""17.1""","""FWD""","""Turbocharged""","""Manual - 6 Gea…",,"""BS 4""","""1956 cc, 4 Cyl…","""350 Nm @ 1750 …",,,,,,,,,,,,,
"""D4112409""","""Mercedes-Benz …","""OM 654""","""Diesel""","""197 bhp @ 3600…",,"""RWD""","""Turbocharged""","""Automatic (TC)…","""Idle Start/Sto…","""BS 6""","""1993 cc, 4 Cyl…","""440 Nm @ 1800-…",,"""Not Applicable…","""245""","""7.3""",,,,,,,,,
"""D4087571""","""Jeep Compass L…",,"""Petrol""","""160 bhp @ 5500…","""14.1""","""FWD""","""Turbocharged""","""Automatic - 7 …",,"""BS 4""","""1368 cc, 4 Cyl…","""250 Nm @ 2500 …",,,,,,,,,,,,,


In [448]:
df_specs_B_Engine.describe()

describe,profileId,vehicle,Engine Type,Fuel Type,Max Power (bhp@rpm),Mileage (ARAI),Drivetrain,Turbocharger / Supercharger,Transmission,Others,Emission Standard,Engine,Max Torque (Nm@rpm),Driving Range,Alternate Fuel,Top Speed,Acceleration (0-100 kmph),Electric Motor,Battery,Max Motor Performance,Battery Charging,Performance on Alternate Fuel,Range (Carwale Tested),Electric Motor Assist,City Mileage (CarWale Tested),Highway Mileage (CarWale Tested)
str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str
"""count""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323"""
"""null_count""","""0""","""0""","""3283""","""14""","""370""","""2763""","""1564""","""6380""","""1""","""24395""","""18605""","""204""","""370""","""24974""","""22947""","""29529""","""28671""","""30065""","""29908""","""30102""","""30156""","""30112""","""30307""","""30307""","""30280""","""30280"""
"""mean""",,,,,,,,,,,,,,,,,,,,,,,,,,
"""std""",,,,,,,,,,,,,,,,,,,,,,,,,,
"""min""","""D1982769""","""Aston Martin R…","""0.8 L""","""CNG""","""100 bhp @ 3600…","""10""","""4WD""","""No""","""AMT - 5 Gears""","""Idle Start/Sto…","""BS 4""","""1047 cc, 3 Cyl…","""100 Nm @ 2700 …","""1000.28""","""CNG""","""120""","""10.24""","""1 3 Phase AC I…","""1.3kWh, Lithiu…","""118 bhp @ 88 r…","""11.3 Hrs @ 220…","""25 bhp @ 3750 …","""223.9""","""143 bhp @ 6200…","""10.62""","""15.34"""
"""25%""",,,,,,,,,,,,,,,,,,,,,,,,,,
"""50%""",,,,,,,,,,,,,,,,,,,,,,,,,,
"""75%""",,,,,,,,,,,,,,,,,,,,,,,,,,
"""max""","""S2805517""","""Volvo XC90 Mom…","""winPower Turbo…","""Petrol""","""@""","""9.91""","""RWD""","""Yes""","""Manual - 6 Gea…","""Regenerative B…","""Not Applicable…","""Not Applicable…","""@""","""997""","""Petrol""","""305""","""9.95""","""Permanent magn…","""Nickel Metal H…","""79 bhp @ 3995 …","""90 Hrs @ 220 V…","""86 bhp @ 5600 …","""340.5""","""215 bhp @ 5700…","""9.5""","""22.3"""


In [449]:
df_specs_B_Engine = df_specs_B_Engine.drop(columns = ['Acceleration (0-100 kmph)','Battery','Battery Charging','Drivetrain','Driving Range',
                                    'Electric Motor','Electric Motor Assist','Emission Standard','Max Motor Performance','Others',
                                     'Performance on Alternate Fuel','Range (Carwale Tested)','Top Speed','Turbocharger / Supercharger'])

In [450]:
df_specs_B_Engine.describe()

describe,profileId,vehicle,Engine Type,Fuel Type,Max Power (bhp@rpm),Mileage (ARAI),Transmission,Engine,Max Torque (Nm@rpm),Alternate Fuel,City Mileage (CarWale Tested),Highway Mileage (CarWale Tested)
str,str,str,str,str,str,str,str,str,str,str,str,str
"""count""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323""","""30323"""
"""null_count""","""0""","""0""","""3283""","""14""","""370""","""2763""","""1""","""204""","""370""","""22947""","""30280""","""30280"""
"""mean""",,,,,,,,,,,,
"""std""",,,,,,,,,,,,
"""min""","""D1982769""","""Aston Martin R…","""0.8 L""","""CNG""","""100 bhp @ 3600…","""10""","""AMT - 5 Gears""","""1047 cc, 3 Cyl…","""100 Nm @ 2700 …","""CNG""","""10.62""","""15.34"""
"""25%""",,,,,,,,,,,,
"""50%""",,,,,,,,,,,,
"""75%""",,,,,,,,,,,,
"""max""","""S2805517""","""Volvo XC90 Mom…","""winPower Turbo…","""Petrol""","""@""","""9.91""","""Manual - 6 Gea…","""Not Applicable…","""@""","""Petrol""","""9.5""","""22.3"""


##### `fuel`

In [451]:
# first clearing out the leading and trailing spaces

df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Fuel Type').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Engine.filter(pl.col('Fuel Type').is_null()).shape[0],'\n',
        'empty',df_specs_B_Engine.filter(pl.col('Fuel Type')=='').shape[0],'\n',
        'space', df_specs_B_Engine.filter(pl.col('Fuel Type')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Engine.filter(pl.col('Fuel Type')=='0').shape[0],'\n',
        'None',df_specs_B_Engine.filter(pl.col('Fuel Type')==None).shape[0],'\n',
        'NA',df_specs_B_Engine.filter(pl.col('Fuel Type')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specifications_B.filter(pl.col('Fuel Type').is_nan()).shape[0],'\n',
        #'0', df_specifications_B.filter(pl.col('Fuel Type')==0).shape[0],'\n',
        

        )

null 14 
 empty 0 
 space 0 
 0 (Str) 0 
 None 14 
 NA 0 



* a few null and nan values where found
* there are LPG and CNG vehicles listed within Alternate Fuel
* checking if we could compensate the fuel column with alternate fuel

In [452]:
df_specs_B_Engine.filter((pl.col('Fuel Type').is_null()) & (pl.col("Alternate Fuel").is_not_null()))

profileId,vehicle,Engine Type,Fuel Type,Max Power (bhp@rpm),Mileage (ARAI),Transmission,Engine,Max Torque (Nm@rpm),Alternate Fuel,City Mileage (CarWale Tested),Highway Mileage (CarWale Tested)
str,str,str,str,str,str,str,str,str,str,str,str
"""D4138427""","""Maruti Suzuki …","""FC engine""",,"""39 bhp @ 6200 …","""26.83""","""Manual - 5 Gea…","""796 cc, 3 Cyli…","""54 Nm @ 3000 r…","""CNG""",,
"""D4120915""","""Maruti Suzuki …","""16V DOHC VVT""",,"""87 bhp @ 5600 …","""21.4""","""Manual - 5 Gea…","""1586 cc, 4 Cyl…","""122 Nm @ 4100 …","""CNG""",,
"""D4122817""","""Maruti Suzuki …","""FC engine""",,"""39 bhp @ 6200 …","""26.83""","""Manual - 5 Gea…","""796 cc, 3 Cyli…","""54 Nm @ 3000 r…","""CNG""",,
"""S2766359""","""Maruti Suzuki …","""16V DOHC VVT""",,"""87 bhp @ 5600 …","""21.4""","""Manual - 5 Gea…","""1586 cc, 4 Cyl…","""122 Nm @ 4100 …","""CNG""",,
"""D4109083""","""Maruti Suzuki …","""FC engine""",,"""39 bhp @ 6200 …","""26.83""","""Manual - 5 Gea…","""796 cc, 3 Cyli…","""54 Nm @ 3000 r…","""CNG""",,
"""D4085263""","""Maruti Suzuki …","""K10B""",,"""46 bhp @ 6200 …","""13.1""","""Manual - 5 Gea…","""998 cc, 3 Cyli…","""85 Nm @ 3500 r…","""LPG""",,
"""D4110431""","""Maruti Suzuki …","""16V DOHC VVT""",,"""87 bhp @ 5600 …","""21.4""","""Manual - 5 Gea…","""1586 cc, 4 Cyl…","""122 Nm @ 4100 …","""CNG""",,
"""D4014089""","""Maruti Suzuki …","""16V DOHC VVT""",,"""87 bhp @ 5600 …","""21.4""","""Manual - 5 Gea…","""1586 cc, 4 Cyl…","""122 Nm @ 4100 …","""CNG""",,
"""D4153601""","""Maruti Suzuki …","""16V DOHC VVT""",,"""87 bhp @ 5600 …","""21.4""","""Manual - 5 Gea…","""1586 cc, 4 Cyl…","""122 Nm @ 4100 …","""CNG""",,
"""D3597535""","""Maruti Suzuki …","""16V DOHC VVT""",,"""87 bhp @ 5600 …","""21.4""","""Manual - 5 Gea…","""1586 cc, 4 Cyl…","""122 Nm @ 4100 …","""CNG""",,


In [453]:
# updating the Alternate Fuel value into Fuel
df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.when(pl.col('Fuel Type').is_null()).then(pl.col('Alternate Fuel')).otherwise(pl.col('Fuel Type')).alias('Fuel'))

In [454]:
list(df_specs_B_Engine['Fuel'].unique())

['CNG',
 'Diesel',
 None,
 'Petrol',
 'Hybrid (Electric + Petrol)',
 'Electric',
 'LPG',
 'Mild Hybrid(Electric + Petrol)']

* there are usual fuel types -  petrol, diesel, cng and lpg
* there are several sub-categories of hybrid, but for analysis purposes, we will consider them as a Hybrid (Electric + Petrol) or Hybrid (Electric + Diesel) category

In [455]:
# replace 
df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Fuel').str.replace('Mild Hybrid(Electric + Petrol)','Hybrid (Electric + Petrol)',literal = True))

In [456]:
list(df_specs_B_Engine['Fuel'].unique())

[None,
 'LPG',
 'CNG',
 'Diesel',
 'Hybrid (Electric + Petrol)',
 'Electric',
 'Petrol']

In [457]:
df_specs_B_Engine = df_specs_B_Engine.drop(['Fuel Type','Alternate Fuel'])

##### `Mileage (ARAI)` ,`City Mileage (CarWale Tested)` and `Highway Mileage (CarWale Tested)`

In [458]:
# first clearing out the leading and trailing spaces

df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Mileage (ARAI)').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Engine.filter(pl.col('Mileage (ARAI)').is_null()).shape[0],'\n',
        'empty',df_specs_B_Engine.filter(pl.col('Mileage (ARAI)')=='').shape[0],'\n',
        'space', df_specs_B_Engine.filter(pl.col('Mileage (ARAI)')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Engine.filter(pl.col('Mileage (ARAI)')=='0').shape[0],'\n',
        'None',df_specs_B_Engine.filter(pl.col('Mileage (ARAI)')==None).shape[0],'\n',
        'NA',df_specs_B_Engine.filter(pl.col('Mileage (ARAI)')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specifications_B.filter(pl.col('Mileage (ARAI)').is_nan()).shape[0],'\n',
        #'0', df_specifications_B.filter(pl.col('Mileage (ARAI)')==0).shape[0],'\n',
        

        )

null 2763 
 empty 0 
 space 0 
 0 (Str) 0 
 None 2763 
 NA 0 



* There are a lot of null milelage values

In [459]:
df_specs_B_Engine.head(2)

profileId,vehicle,Engine Type,Max Power (bhp@rpm),Mileage (ARAI),Transmission,Engine,Max Torque (Nm@rpm),City Mileage (CarWale Tested),Highway Mileage (CarWale Tested),Fuel
str,str,str,str,str,str,str,str,str,str,str
"""D4135089""","""Tata Harrier X…","""2.0 L Kryotec""","""168 bhp @ 3750…","""16.3""","""Manual - 6 Gea…","""1956 cc, 4 Cyl…","""350 Nm @ 1750 …",,,"""Diesel"""
"""D4170587""","""BMW 3 Series G…",,"""184 bhp @ 4000…","""19.59""","""Automatic - 8 …","""1995 cc, 4 Cyl…","""380 Nm @ 1750 …",,,"""Diesel"""


* checking if value exists for Highway Mileage (CarWale Tested) with empty City Mileage (CarWale Tested)


In [460]:
# first clearing out the leading and trailing spaces

df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('City Mileage (CarWale Tested)').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Engine.filter(pl.col('City Mileage (CarWale Tested)').is_null()).shape[0],'\n',
        'empty',df_specs_B_Engine.filter(pl.col('City Mileage (CarWale Tested)')=='').shape[0],'\n',
        'space', df_specs_B_Engine.filter(pl.col('City Mileage (CarWale Tested)')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Engine.filter(pl.col('City Mileage (CarWale Tested)')=='0').shape[0],'\n',
        'None',df_specs_B_Engine.filter(pl.col('City Mileage (CarWale Tested)')==None).shape[0],'\n',
        'NA',df_specs_B_Engine.filter(pl.col('City Mileage (CarWale Tested)')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specifications_B.filter(pl.col('City Mileage (CarWale Tested)').is_nan()).shape[0],'\n',
        #'0', df_specifications_B.filter(pl.col('City Mileage (CarWale Tested)')==0).shape[0],'\n',
        

        )

null 30280 
 empty 0 
 space 0 
 0 (Str) 0 
 None 30280 
 NA 0 



In [461]:
# first clearing out the leading and trailing spaces

df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Highway Mileage (CarWale Tested)').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Engine.filter(pl.col('Highway Mileage (CarWale Tested)').is_null()).shape[0],'\n',
        'empty',df_specs_B_Engine.filter(pl.col('Highway Mileage (CarWale Tested)')=='').shape[0],'\n',
        'space', df_specs_B_Engine.filter(pl.col('Highway Mileage (CarWale Tested)')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Engine.filter(pl.col('Highway Mileage (CarWale Tested)')=='0').shape[0],'\n',
        'None',df_specs_B_Engine.filter(pl.col('Highway Mileage (CarWale Tested)')==None).shape[0],'\n',
        'NA',df_specs_B_Engine.filter(pl.col('Highway Mileage (CarWale Tested)')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specifications_B.filter(pl.col('Highway Mileage (CarWale Tested)').is_nan()).shape[0],'\n',
        #'0', df_specifications_B.filter(pl.col('Highway Mileage (CarWale Tested)')==0).shape[0],'\n',
        

        )

null 30280 
 empty 0 
 space 0 
 0 (Str) 0 
 None 30280 
 NA 0 



In [462]:
df_specs_B_Engine.filter((pl.col('City Mileage (CarWale Tested)').is_not_null()) & (pl.col('Highway Mileage (CarWale Tested)').is_null()))

profileId,vehicle,Engine Type,Max Power (bhp@rpm),Mileage (ARAI),Transmission,Engine,Max Torque (Nm@rpm),City Mileage (CarWale Tested),Highway Mileage (CarWale Tested),Fuel
str,str,str,str,str,str,str,str,str,str,str


In [463]:
# checking if value exists for City Mileage (CarWale Tested) with empty or null Highway Mileage (CarWale Tested)
df_specs_B_Engine.filter((pl.col('City Mileage (CarWale Tested)').is_null()) & (pl.col('Highway Mileage (CarWale Tested)').is_not_null()))

profileId,vehicle,Engine Type,Max Power (bhp@rpm),Mileage (ARAI),Transmission,Engine,Max Torque (Nm@rpm),City Mileage (CarWale Tested),Highway Mileage (CarWale Tested),Fuel
str,str,str,str,str,str,str,str,str,str,str


* checking if Mileage NaN values could be replaced with City Mileage or Highway Mileage

In [464]:
df_specs_B_Engine.filter((pl.col('Mileage (ARAI)').is_null()) & (pl.col('City Mileage (CarWale Tested)').is_not_null()))

profileId,vehicle,Engine Type,Max Power (bhp@rpm),Mileage (ARAI),Transmission,Engine,Max Torque (Nm@rpm),City Mileage (CarWale Tested),Highway Mileage (CarWale Tested),Fuel
str,str,str,str,str,str,str,str,str,str,str


In [465]:
df_specs_B_Engine.filter((pl.col('Mileage (ARAI)').is_null()) & (pl.col('Highway Mileage (CarWale Tested)').is_not_null()))

profileId,vehicle,Engine Type,Max Power (bhp@rpm),Mileage (ARAI),Transmission,Engine,Max Torque (Nm@rpm),City Mileage (CarWale Tested),Highway Mileage (CarWale Tested),Fuel
str,str,str,str,str,str,str,str,str,str,str


In [466]:
# city and highway mileage will not contribute much to analysis
# dropping both city and highway mileage columns
df_specs_B_Engine = df_specs_B_Engine.drop(['City Mileage (CarWale Tested)','Highway Mileage (CarWale Tested)'])

In [467]:
df_specs_B_Engine.head(2)

profileId,vehicle,Engine Type,Max Power (bhp@rpm),Mileage (ARAI),Transmission,Engine,Max Torque (Nm@rpm),Fuel
str,str,str,str,str,str,str,str,str
"""D4135089""","""Tata Harrier X…","""2.0 L Kryotec""","""168 bhp @ 3750…","""16.3""","""Manual - 6 Gea…","""1956 cc, 4 Cyl…","""350 Nm @ 1750 …","""Diesel"""
"""D4170587""","""BMW 3 Series G…",,"""184 bhp @ 4000…","""19.59""","""Automatic - 8 …","""1995 cc, 4 Cyl…","""380 Nm @ 1750 …","""Diesel"""


In [468]:
print(list(df_specs_B_Engine['Mileage (ARAI)'].unique()))

['12.62', '11', '20.58', '12.9', '19.12', '12.36', '25.31999969482422', '16.93', '15.56', '6.9', '9.1', '13.7', None, '17.24', '9.76', '24.75', '20.6', '18.18', '18.2', '9.8', '19.83', '13.24', '14.3', '7.96', '20.73', '10.4', '22.74', '13.9', '10.6', '6.1', '11.14', '10.91', '14.22', '22.54', '12.81', '12.08', '15.06', '26.83', '22.7', '21.38', '16.38', '17.69', '20.25', '26.4', '7.9', '22.8', '17.8', '24.2', '25.11', '22.9', '16.78', '18.78', '15.63', '17.95', '11.12', '25.2', '17.33', '25.6', '24.200000762939453', '26.49', '17.2', '16.47', '15.17', '15.85', '9.46', '26.11', '22.32', '23.9', '12.03', '18.48', '10.93', '21.72', '11.5', '15.97', '12.2', '10.9', '13.6', '15.1', '15.6', '13.17', '25.83', '10.34', '12.88', '24.5', '25.44', '21.15', '17.52', '16.29', '10.7', '20.37', '24.29', '20.28', '14.71', '22.04', '20.3799991607666', '14.23', '24.7', '17.71', '18.143', '17.97', '15.3', '14.29', '12.18', '31.19', '26.59', '19.4', '15.96', '16.5', '14.42', '25.5', '18.15', '12.37', '12.

In [469]:
# converting mileage column to float
df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Mileage (ARAI)').cast(pl.Float64))

In [470]:
df_specs_B_Engine.describe()

describe,profileId,vehicle,Engine Type,Max Power (bhp@rpm),Mileage (ARAI),Transmission,Engine,Max Torque (Nm@rpm),Fuel
str,str,str,str,str,f64,str,str,str,str
"""count""","""30323""","""30323""","""30323""","""30323""",30323.0,"""30323""","""30323""","""30323""","""30323"""
"""null_count""","""0""","""0""","""3283""","""370""",2763.0,"""1""","""204""","""370""","""1"""
"""mean""",,,,,18.279344,,,,
"""std""",,,,,4.551692,,,,
"""min""","""D1982769""","""Aston Martin R…","""0.8 L""","""100 bhp @ 3600…",4.7,"""AMT - 5 Gears""","""1047 cc, 3 Cyl…","""100 Nm @ 2700 …","""CNG"""
"""25%""",,,,,15.26,,,,
"""50%""",,,,,18.15,,,,
"""75%""",,,,,21.0,,,,
"""max""","""S2805517""","""Volvo XC90 Mom…","""winPower Turbo…","""@""",140.0,"""Manual - 6 Gea…","""Not Applicable…","""@""","""Petrol"""


##### `Transmission`

In [471]:
# first clearing out the leading and trailing spaces

df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Transmission').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Engine.filter(pl.col('Transmission').is_null()).shape[0],'\n',
        'empty',df_specs_B_Engine.filter(pl.col('Transmission')=='').shape[0],'\n',
        'space', df_specs_B_Engine.filter(pl.col('Transmission')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Engine.filter(pl.col('Transmission')=='0').shape[0],'\n',
        'None',df_specs_B_Engine.filter(pl.col('Transmission')==None).shape[0],'\n',
        'NA',df_specs_B_Engine.filter(pl.col('Transmission')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Engine.filter(pl.col('Transmission').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Engine.filter(pl.col('Transmission')==0).shape[0],'\n',
        

        )

null 1 
 empty 0 
 space 0 
 0 (Str) 0 
 None 1 
 NA 0 



In [472]:
print(list(df_specs_B_Engine['Transmission'].unique()))

['Manual - 6 Gears, Manual Override', 'Automatic (TC) - 7 Gears, Manual Override & Paddle Shift, Sport Mode', 'Automatic - 4 Gears', 'Automatic (DCT) - 6 Gears, Manual Override', 'Automatic (CVT) - 6 Gears', 'Automatic (TC) - 9 Gears, Paddle Shift, Sport Mode', 'Automatic - 10 Gears, Manual Override, Sport Mode', 'Automatic (CVT) - CVT Gears, Paddle Shift, Sport Mode', None, 'Manual - 5 Gears', 'Automatic (TC)', 'Automatic (TC) - 6 Gears, Paddle Shift', 'Automatic (CVT) - 7 Gears, Paddle Shift, Sport Mode', 'Automatic (CVT) - 6 Gears, Manual Override', 'Automatic (TC) - 6 Gears, Manual Override', 'Automatic - 8 Gears, Sport Mode', 'Automatic (DCT) - 8 Gears, Manual Override & Paddle Shift, Sport Mode', 'Automatic (AMT) - 5 Gears, Manual Override', 'AMT - 5 Gears, Manual Override', 'Automatic (DCT) - 7 Gears, Manual Override, Sport Mode', 'Automatic (CVT) - 7 Gears, Manual Override', 'Automatic (TC) - 7 Gears, Manual Override', 'Automatic - 7 Gears, Sport Mode', 'Automatic - 8 Gears, Ma

* transmission column values not null
* grouping them in 2 categories
    * Automatic
    * Manual

In [473]:
df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Transmission').str.split(' ').list[0].str.replace("Clutchless",'Automatic').str.replace('AMT','Automatic').alias('transmission'))

In [474]:
# deleting the original Transmission column
df_specs_B_Engine = df_specs_B_Engine.drop(['Transmission'])

In [475]:
df_specs_B_Engine.describe()

describe,profileId,vehicle,Engine Type,Max Power (bhp@rpm),Mileage (ARAI),Engine,Max Torque (Nm@rpm),Fuel,transmission
str,str,str,str,str,f64,str,str,str,str
"""count""","""30323""","""30323""","""30323""","""30323""",30323.0,"""30323""","""30323""","""30323""","""30323"""
"""null_count""","""0""","""0""","""3283""","""370""",2763.0,"""204""","""370""","""1""","""1"""
"""mean""",,,,,18.279344,,,,
"""std""",,,,,4.551692,,,,
"""min""","""D1982769""","""Aston Martin R…","""0.8 L""","""100 bhp @ 3600…",4.7,"""1047 cc, 3 Cyl…","""100 Nm @ 2700 …","""CNG""","""Automatic"""
"""25%""",,,,,15.26,,,,
"""50%""",,,,,18.15,,,,
"""75%""",,,,,21.0,,,,
"""max""","""S2805517""","""Volvo XC90 Mom…","""winPower Turbo…","""@""",140.0,"""Not Applicable…","""@""","""Petrol""","""Manual"""


##### `Engine AND Engine Type`

In [476]:
# first clearing out the leading and trailing spaces

df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Engine').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Engine.filter(pl.col('Engine').is_null()).shape[0],'\n',
        'empty',df_specs_B_Engine.filter(pl.col('Engine')=='').shape[0],'\n',
        'space', df_specs_B_Engine.filter(pl.col('Engine')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Engine.filter(pl.col('Engine')=='0').shape[0],'\n',
        'None',df_specs_B_Engine.filter(pl.col('Engine')==None).shape[0],'\n',
        'NA',df_specs_B_Engine.filter(pl.col('Engine')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Engine.filter(pl.col('Engine').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Engine.filter(pl.col('Engine')==0).shape[0],'\n',
        

        )

null 204 
 empty 0 
 space 0 
 0 (Str) 0 
 None 204 
 NA 0 



* There are 204  null values
* Checking if any other field could compensate the null values

In [477]:
df_specs_B_Engine.head(2)

profileId,vehicle,Engine Type,Max Power (bhp@rpm),Mileage (ARAI),Engine,Max Torque (Nm@rpm),Fuel,transmission
str,str,str,str,f64,str,str,str,str
"""D4135089""","""Tata Harrier X…","""2.0 L Kryotec""","""168 bhp @ 3750…",16.3,"""1956 cc, 4 Cyl…","""350 Nm @ 1750 …","""Diesel""","""Manual"""
"""D4170587""","""BMW 3 Series G…",,"""184 bhp @ 4000…",19.59,"""1995 cc, 4 Cyl…","""380 Nm @ 1750 …","""Diesel""","""Automatic"""


* Checking if Engine Type field is of any use

In [478]:
list(df_specs_B_Engine['Engine Type'].unique())

['Revotron, MPFi with MULTI DRIVE',
 'turbocharged diesel engine, turbocharger with self-aligning blades, in-line, liquid cooling system, high-pressure direct injection system, DOHC, transverse in front',
 'F M 2.6 CR CD BS VI',
 '1.4 U2 CRDi Diesel 16 Valves, 4 Cylinder',
 '1.2 L VVT Engine',
 '2GD-FTV',
 'Four Cylinder Inline Turbocharged',
 'V6 cylinder diesel engine with turbocharger and second generation common rall technology',
 'TwinPower Turbo 4-Cylinder engine',
 'Turbocharged petrol engine with direct injection system',
 '1.2L Revotron Petrol',
 '1.3L, DOHC, DDiS Diesel',
 '1.5L, SOHC, 4-cylinder, i-VTEC',
 'Inline four-cylinder TDI diesel engine',
 '2.2L VariCOR',
 'Comon Rail CR4, 16-valve, DOHC',
 '2.0 Liter, 4-cylinder, 16 valve, DOHC, VVT-i',
 '3.5L Petrol',
 'B 180 BlueEFFICIENCY 4 cylinder inline',
 '1.2L Turbocharged Revotron Engine',
 'Electric',
 'Four-cylinder twin turbo-charged diesel engine',
 'R 2.2 CRDi H-matic',
 '2GD-FTV Diesel',
 'Water-Cooled, 4-Stroke, i-V

* no cc data was found in Engine Type and is not useful* 
deleting this column

In [479]:
df_specs_B_Engine = df_specs_B_Engine.drop(['Engine Type'])

* Engine column contains 4 values - cc, cylinders, valve and camshaft
* cylinders, valve and camshaft column values are not needed for analysis
* cc value will only be extractednn

In [480]:
df_specs_B_Engine.head(2)

profileId,vehicle,Max Power (bhp@rpm),Mileage (ARAI),Engine,Max Torque (Nm@rpm),Fuel,transmission
str,str,str,f64,str,str,str,str
"""D4135089""","""Tata Harrier X…","""168 bhp @ 3750…",16.3,"""1956 cc, 4 Cyl…","""350 Nm @ 1750 …","""Diesel""","""Manual"""
"""D4170587""","""BMW 3 Series G…","""184 bhp @ 4000…",19.59,"""1995 cc, 4 Cyl…","""380 Nm @ 1750 …","""Diesel""","""Automatic"""


In [481]:
df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Engine').str.split(' ').list[0].str.strip_chars().str.replace("NaNot|NaNOHC|DOHC|Not",np.nan).cast(pl.Float64).alias('Displacement_(cc)'))

* deleting Engine columns

In [482]:
df_specs_B_Engine = df_specs_B_Engine.drop('Engine')

##### `Max Power (bhp@rpm) AND Max Torque (Nm@rpm)`

In [483]:
# first clearing out the leading and trailing spaces

df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Max Power (bhp@rpm)').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Engine.filter(pl.col('Max Power (bhp@rpm)').is_null()).shape[0],'\n',
        'empty',df_specs_B_Engine.filter(pl.col('Max Power (bhp@rpm)')=='').shape[0],'\n',
        'space', df_specs_B_Engine.filter(pl.col('Max Power (bhp@rpm)')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Engine.filter(pl.col('Max Power (bhp@rpm)')=='0').shape[0],'\n',
        'None',df_specs_B_Engine.filter(pl.col('Max Power (bhp@rpm)')==None).shape[0],'\n',
        'NA',df_specs_B_Engine.filter(pl.col('Max Power (bhp@rpm)')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Engine.filter(pl.col('Max Power (bhp@rpm)').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Engine.filter(pl.col('Max Power (bhp@rpm)')==0).shape[0],'\n',
        

        )

null 370 
 empty 0 
 space 0 
 0 (Str) 0 
 None 370 
 NA 0 



In [484]:
# first clearing out the leading and trailing spaces

df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Max Torque (Nm@rpm)').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Engine.filter(pl.col('Max Torque (Nm@rpm)').is_null()).shape[0],'\n',
        'empty',df_specs_B_Engine.filter(pl.col('Max Torque (Nm@rpm)')=='').shape[0],'\n',
        'space', df_specs_B_Engine.filter(pl.col('Max Torque (Nm@rpm)')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Engine.filter(pl.col('Max Torque (Nm@rpm)')=='0').shape[0],'\n',
        'None',df_specs_B_Engine.filter(pl.col('Max Torque (Nm@rpm)')==None).shape[0],'\n',
        'NA',df_specs_B_Engine.filter(pl.col('Max Torque (Nm@rpm)')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Engine.filter(pl.col('Max Torque (Nm@rpm)').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Engine.filter(pl.col('Max Torque (Nm@rpm)')==0).shape[0],'\n',
        

        )

null 370 
 empty 0 
 space 0 
 0 (Str) 0 
 None 370 
 NA 0 



* there are 370 null values in both torque and power
* splitting both torque and power in separate columns

In [485]:
 df_specs_B_Engine =  df_specs_B_Engine.with_columns(pl.col('Max Torque (Nm@rpm)').str.split('@').list.to_struct(fields = ['Max_Torque_(Nm)','Max_Torque_at_rpm']).alias('Torque'),
                                              pl.col('Max Power (bhp@rpm)').str.split('@').list.to_struct(fields = ['Max_Power_(bhp)','Max_Power_at_rpm']).alias('Power')
                                              
                                              ).unnest('Torque','Power')


In [486]:
df_specs_B_Engine.head(2)

profileId,vehicle,Max Power (bhp@rpm),Mileage (ARAI),Max Torque (Nm@rpm),Fuel,transmission,Displacement_(cc),Max_Torque_(Nm),Max_Torque_at_rpm,Max_Power_(bhp),Max_Power_at_rpm
str,str,str,f64,str,str,str,f64,str,str,str,str
"""D4135089""","""Tata Harrier X…","""168 bhp @ 3750…",16.3,"""350 Nm @ 1750 …","""Diesel""","""Manual""",1956.0,"""350 Nm """,""" 1750 rpm""","""168 bhp """,""" 3750 rpm"""
"""D4170587""","""BMW 3 Series G…","""184 bhp @ 4000…",19.59,"""380 Nm @ 1750 …","""Diesel""","""Automatic""",1995.0,"""380 Nm """,""" 1750 rpm""","""184 bhp """,""" 4000 rpm"""


In [487]:
df_specs_B_Engine =  df_specs_B_Engine.with_columns(
                                                     pl.col('Max_Torque_(Nm)').str.strip_chars().str.replace(' Nm','').str.replace('',0,literal=False).cast(pl.Float64).replace({0.0:None}),
                                                     pl.col('Max_Power_(bhp)').str.strip_chars().str.replace(' bhp','').str.replace('',0,literal=False).cast(pl.Float64).replace({0.0:None})
                     
                                                     )

In [488]:
df_specs_B_Engine.head()

profileId,vehicle,Max Power (bhp@rpm),Mileage (ARAI),Max Torque (Nm@rpm),Fuel,transmission,Displacement_(cc),Max_Torque_(Nm),Max_Torque_at_rpm,Max_Power_(bhp),Max_Power_at_rpm
str,str,str,f64,str,str,str,f64,f64,str,f64,str
"""D4135089""","""Tata Harrier X…","""168 bhp @ 3750…",16.3,"""350 Nm @ 1750 …","""Diesel""","""Manual""",1956.0,350.0,""" 1750 rpm""",168.0,""" 3750 rpm"""
"""D4170587""","""BMW 3 Series G…","""184 bhp @ 4000…",19.59,"""380 Nm @ 1750 …","""Diesel""","""Automatic""",1995.0,380.0,""" 1750 rpm""",184.0,""" 4000 rpm"""
"""D4170559""","""Jeep Compass L…","""171 bhp @ 3750…",17.1,"""350 Nm @ 1750 …","""Diesel""","""Manual""",1956.0,350.0,""" 1750 rpm""",171.0,""" 3750 rpm"""
"""D4112409""","""Mercedes-Benz …","""197 bhp @ 3600…",,"""440 Nm @ 1800-…","""Diesel""","""Automatic""",1993.0,440.0,""" 1800-2800 rpm…",197.0,""" 3600 rpm"""
"""D4087571""","""Jeep Compass L…","""160 bhp @ 5500…",14.1,"""250 Nm @ 2500 …","""Petrol""","""Automatic""",1368.0,250.0,""" 2500 rpm""",160.0,""" 5500 rpm"""


In [489]:
df_specs_B_Engine = df_specs_B_Engine.with_columns(pl.col('Max_Torque_at_rpm').str.strip_chars().str.slice(0,4).str.replace('',0,literal=False).cast(pl.Float64).replace({0.0:None}),
                                                   pl.col('Max_Power_at_rpm').str.strip_chars().str.slice(0,4).str.replace('',0,literal=False).cast(pl.Float64).replace({0.0:None})
                                                       )

In [490]:
# removing torque and power columns

df_specs_B_Engine = df_specs_B_Engine.drop(['Max Torque (Nm@rpm)','Max Power (bhp@rpm)'])

##### FINAL ENGINE DATASET

In [491]:
df_specs_B_Engine.write_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-II\specs_Engine_B.csv')

#### `Dimensions & Weight`

In [567]:
#looking at the data

df_specs_B_Dimensions = df_specifications_B.filter(pl.col('spec_category')=='Dimensions & Weight')


In [568]:
df_specs_B_Dimensions.head()

specName,specValue,specUnit,spec_category,profileId,vehicle
str,str,str,str,str,str
"""Length""","""4395""","""mm""","""Dimensions & W…","""D4170559""","""Jeep Compass L…"
"""Width""","""1818""","""mm""","""Dimensions & W…","""D4170559""","""Jeep Compass L…"
"""Kerb Weight""","""1584""","""kg""","""Dimensions & W…","""D4170559""","""Jeep Compass L…"
"""Wheelbase""","""2865""","""mm""","""Dimensions & W…","""D4112409""","""Mercedes-Benz …"
"""Height""","""1447""","""mm""","""Dimensions & W…","""D4039375""","""Mercedes-Benz …"


##### PIVOTING TABLE

In [569]:
df_specs_B_Dimensions = df_specs_B_Dimensions.pivot(index = ['profileId','vehicle'],columns = 'specName', values = 'specValue')

In [570]:
df_specs_B_Dimensions.head()

profileId,vehicle,Length,Width,Kerb Weight,Wheelbase,Height,Ground Clearance
str,str,str,str,str,str,str,str
"""D4170559""","""Jeep Compass L…","""4395""","""1818""","""1584""","""2636""","""1640""",
"""D4112409""","""Mercedes-Benz …","""4751""","""1820""",,"""2865""","""1437""",
"""D4039375""","""Mercedes-Benz …","""4596""","""1770""","""1610""","""2960""","""1447""",
"""D4134657""","""Jeep Compass L…","""4395""","""1818""","""1584""","""2636""","""1640""",
"""D4056433""","""Jeep Compass L…","""4395""","""1818""","""1654""","""2636""","""1640""",


##### `Ground Clearance`

In [496]:
# first clearing out the leading and trailing spaces

df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Ground Clearance').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Dimensions.filter(pl.col('Ground Clearance').is_null()).shape[0],'\n',
        'empty',df_specs_B_Dimensions.filter(pl.col('Ground Clearance')=='').shape[0],'\n',
        'space', df_specs_B_Dimensions.filter(pl.col('Ground Clearance')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Dimensions.filter(pl.col('Ground Clearance')=='0').shape[0],'\n',
        'None',df_specs_B_Dimensions.filter(pl.col('Ground Clearance')==None).shape[0],'\n',
        'NA',df_specs_B_Dimensions.filter(pl.col('Ground Clearance')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Dimensions.filter(pl.col('Ground Clearance').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Dimensions.filter(pl.col('Ground Clearance')==0).shape[0],'\n',
        

        )

null 7437 
 empty 0 
 space 0 
 0 (Str) 0 
 None 7437 
 NA 0 



* huge null values can be seen
* we will be using this for analysis

In [571]:
list(df_specs_B_Dimensions['Ground Clearance'].unique())

['159',
 '202',
 None,
 '126',
 '215',
 '118',
 '154',
 '206',
 '221',
 '220',
 '184',
 '110',
 '186',
 '223',
 '227',
 '116',
 '192',
 '140',
 '172',
 '133',
 '112',
 '211',
 '145',
 '210',
 '139',
 '190',
 '185',
 '200',
 '165',
 '244',
 '175',
 '164',
 '114',
 '177',
 '181',
 '141',
 '218',
 '189',
 '176',
 '180',
 '187',
 '155',
 '295.5',
 '196',
 '157',
 '179',
 '204.8',
 '151',
 '163',
 '100',
 '217',
 '109',
 '225',
 '138',
 '128',
 '149',
 '168',
 '113',
 '161',
 '136',
 '135',
 '195',
 '158',
 '150',
 '170',
 '120',
 '137',
 '134',
 '142',
 '171',
 '117',
 '183',
 '167',
 '201',
 '204',
 '198',
 '230',
 '219',
 '188',
 '205',
 '208',
 '226',
 '174',
 '213',
 '209',
 '214',
 '144',
 '239.8',
 '156',
 '130',
 '216',
 '182',
 '197',
 '152',
 '238',
 '147',
 '212',
 '160']

In [572]:
# converting to float
df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Ground Clearance').cast(pl.Float64))

In [573]:
list(df_specs_B_Dimensions['Ground Clearance'].unique())

[None,
 100.0,
 109.0,
 110.0,
 112.0,
 113.0,
 114.0,
 116.0,
 117.0,
 118.0,
 120.0,
 126.0,
 128.0,
 130.0,
 133.0,
 134.0,
 135.0,
 136.0,
 137.0,
 138.0,
 139.0,
 140.0,
 141.0,
 142.0,
 144.0,
 145.0,
 147.0,
 149.0,
 150.0,
 151.0,
 152.0,
 154.0,
 155.0,
 156.0,
 157.0,
 158.0,
 159.0,
 160.0,
 161.0,
 163.0,
 164.0,
 165.0,
 167.0,
 168.0,
 170.0,
 171.0,
 172.0,
 174.0,
 175.0,
 176.0,
 177.0,
 179.0,
 180.0,
 181.0,
 182.0,
 183.0,
 184.0,
 185.0,
 186.0,
 187.0,
 188.0,
 189.0,
 190.0,
 192.0,
 195.0,
 196.0,
 197.0,
 198.0,
 200.0,
 201.0,
 202.0,
 204.0,
 204.8,
 205.0,
 206.0,
 208.0,
 209.0,
 210.0,
 211.0,
 212.0,
 213.0,
 214.0,
 215.0,
 216.0,
 217.0,
 218.0,
 219.0,
 220.0,
 221.0,
 223.0,
 225.0,
 226.0,
 227.0,
 230.0,
 238.0,
 239.8,
 244.0,
 295.5]

##### `Height`

In [574]:
# first clearing out the leading and trailing spaces

df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Height').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Dimensions.filter(pl.col('Height').is_null()).shape[0],'\n',
        'empty',df_specs_B_Dimensions.filter(pl.col('Height')=='').shape[0],'\n',
        'space', df_specs_B_Dimensions.filter(pl.col('Height')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Dimensions.filter(pl.col('Height')=='0').shape[0],'\n',
        'None',df_specs_B_Dimensions.filter(pl.col('Height')==None).shape[0],'\n',
        'NA',df_specs_B_Dimensions.filter(pl.col('Height')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Dimensions.filter(pl.col('Height').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Dimensions.filter(pl.col('Height')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



* no null values

In [575]:
# converting to float
df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Height').cast(pl.Float64))

In [576]:
list(df_specs_B_Dimensions['Height'].unique())

[1165.0,
 1184.0,
 1240.0,
 1241.0,
 1252.0,
 1269.0,
 1281.0,
 1282.0,
 1291.0,
 1292.0,
 1295.0,
 1297.0,
 1298.0,
 1301.0,
 1303.0,
 1304.0,
 1308.0,
 1315.0,
 1322.0,
 1329.0,
 1353.0,
 1360.0,
 1366.0,
 1369.0,
 1370.0,
 1377.0,
 1382.0,
 1383.0,
 1386.0,
 1390.0,
 1391.0,
 1392.0,
 1393.0,
 1395.0,
 1398.0,
 1401.0,
 1402.0,
 1403.0,
 1404.0,
 1405.0,
 1407.0,
 1409.0,
 1410.0,
 1411.0,
 1412.0,
 1414.0,
 1415.0,
 1416.0,
 1417.0,
 1418.0,
 1420.0,
 1421.0,
 1422.0,
 1423.0,
 1425.0,
 1426.0,
 1427.0,
 1429.0,
 1431.0,
 1432.0,
 1433.0,
 1434.0,
 1435.0,
 1437.0,
 1439.0,
 1440.0,
 1441.0,
 1442.0,
 1443.0,
 1445.0,
 1446.0,
 1447.0,
 1448.0,
 1450.0,
 1452.0,
 1453.0,
 1455.0,
 1456.0,
 1457.0,
 1458.0,
 1459.0,
 1460.0,
 1461.0,
 1464.0,
 1465.0,
 1466.0,
 1467.0,
 1468.0,
 1469.0,
 1470.0,
 1471.0,
 1472.0,
 1473.0,
 1474.0,
 1475.0,
 1476.0,
 1477.0,
 1478.0,
 1479.0,
 1480.0,
 1481.0,
 1482.0,
 1483.0,
 1484.0,
 1485.0,
 1486.0,
 1487.0,
 1488.0,
 1489.0,
 1490.0,
 1494.0,
 

##### `Kerb Weight`

In [577]:
# first clearing out the leading and trailing spaces

df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Kerb Weight').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Dimensions.filter(pl.col('Kerb Weight').is_null()).shape[0],'\n',
        'empty',df_specs_B_Dimensions.filter(pl.col('Kerb Weight')=='').shape[0],'\n',
        'space', df_specs_B_Dimensions.filter(pl.col('Kerb Weight')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Dimensions.filter(pl.col('Kerb Weight')=='0').shape[0],'\n',
        'None',df_specs_B_Dimensions.filter(pl.col('Kerb Weight')==None).shape[0],'\n',
        'NA',df_specs_B_Dimensions.filter(pl.col('Kerb Weight')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Dimensions.filter(pl.col('Kerb Weight').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Dimensions.filter(pl.col('Kerb Weight')==0).shape[0],'\n',
        

        )

null 10891 
 empty 0 
 space 0 
 0 (Str) 0 
 None 10891 
 NA 0 



* huge amount of null values

In [578]:
list(df_specs_B_Dimensions['Kerb Weight'].unique())

['1296',
 '2583',
 '1760',
 '2261',
 '1366',
 '859',
 '1312',
 '1710',
 '1580',
 '1567',
 '1027',
 '2230',
 '1755',
 '1750',
 '1261',
 '1041',
 '1092',
 '879',
 '2330',
 '2725',
 '1164',
 '1216',
 '2177',
 '1275',
 '1575',
 '1152',
 '975',
 '928',
 '1850',
 '2340',
 '1065',
 '678',
 '1335',
 '1148',
 '970',
 '845',
 '812',
 '1187',
 '2113',
 '1345',
 '1026',
 '1460',
 '1125',
 '2505',
 '1213',
 '965',
 '1097',
 '767',
 '932',
 '1976',
 '1545',
 '1380',
 '1530',
 '1173',
 '1876',
 '737',
 '1215',
 '2345',
 '1301',
 '1161',
 '1003',
 '784',
 '1608',
 '1572',
 '1049',
 '1231',
 '1614',
 '920',
 '727',
 '846',
 '1835',
 '764',
 '1106',
 '1868',
 '912',
 '762',
 '1023',
 '2200',
 '1150',
 '1302',
 '1029',
 '1884',
 '923',
 '1090',
 '1930',
 '1325',
 '831',
 '1965',
 '2360',
 '1995',
 '2065',
 '1562',
 '1176',
 '1332',
 '1705',
 '2132',
 '1266',
 '1008',
 '1891',
 '1016',
 '815',
 '1945',
 '722',
 '1242',
 '1816',
 '2155',
 '1061',
 '1719',
 '2186',
 '837',
 '1830',
 '2097',
 '1195',
 '2666'

In [579]:
# converting to float
df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Kerb Weight').cast(pl.Float64))

In [580]:
list(df_specs_B_Dimensions['Kerb Weight'].unique())

[None,
 214.0,
 615.0,
 635.0,
 650.0,
 660.0,
 665.0,
 669.0,
 678.0,
 705.0,
 715.0,
 720.0,
 722.0,
 725.0,
 727.0,
 732.0,
 735.0,
 737.0,
 740.0,
 745.0,
 750.0,
 755.0,
 757.0,
 758.0,
 760.0,
 761.0,
 762.0,
 763.0,
 764.0,
 765.0,
 767.0,
 769.0,
 774.0,
 779.0,
 784.0,
 785.0,
 795.0,
 800.0,
 805.0,
 810.0,
 812.0,
 815.0,
 820.0,
 823.0,
 825.0,
 827.0,
 830.0,
 831.0,
 833.0,
 834.0,
 835.0,
 837.0,
 838.0,
 839.0,
 840.0,
 841.0,
 842.0,
 844.0,
 845.0,
 846.0,
 848.0,
 849.0,
 850.0,
 854.0,
 855.0,
 858.0,
 859.0,
 860.0,
 865.0,
 866.0,
 870.0,
 875.0,
 876.0,
 879.0,
 880.0,
 882.0,
 885.0,
 886.0,
 888.0,
 890.0,
 891.0,
 892.0,
 895.0,
 896.0,
 898.0,
 900.0,
 904.0,
 905.0,
 908.0,
 910.0,
 912.0,
 913.0,
 915.0,
 917.0,
 920.0,
 921.0,
 923.0,
 924.0,
 925.0,
 928.0,
 930.0,
 932.0,
 935.0,
 939.0,
 940.0,
 942.0,
 945.0,
 947.0,
 950.0,
 955.0,
 957.0,
 960.0,
 963.0,
 965.0,
 970.0,
 971.0,
 975.0,
 978.0,
 979.0,
 980.0,
 982.0,
 985.0,
 987.0,
 990.0,
 992.0,
 

##### `Length`

In [581]:
# first clearing out the leading and trailing spaces

df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Length').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Dimensions.filter(pl.col('Length').is_null()).shape[0],'\n',
        'empty',df_specs_B_Dimensions.filter(pl.col('Length')=='').shape[0],'\n',
        'space', df_specs_B_Dimensions.filter(pl.col('Length')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Dimensions.filter(pl.col('Length')=='0').shape[0],'\n',
        'None',df_specs_B_Dimensions.filter(pl.col('Length')==None).shape[0],'\n',
        'NA',df_specs_B_Dimensions.filter(pl.col('Length')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Dimensions.filter(pl.col('Height').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Dimensions.filter(pl.col('Height')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



In [510]:
list(df_specs_B_Dimensions['Length'].unique())

['4657',
 '4247',
 '4735',
 '4436',
 '3805',
 '4561',
 '4234',
 '4310',
 '4658',
 '4559',
 '4160',
 '3821',
 '4540',
 '4592',
 '3994',
 '4435',
 '5267',
 '4270',
 '3810',
 '4751',
 '4715',
 '5118',
 '3795',
 '4365',
 '4655',
 '4780',
 '4177',
 '4958',
 '3700',
 '4440',
 '4723',
 '5325',
 '4781',
 '4395',
 '4818',
 '3886',
 '4871',
 '4726',
 '4830',
 '4810',
 '4833',
 '4528',
 '3775',
 '5192',
 '5076',
 '3446',
 '5067',
 '4032',
 '3590',
 '4525',
 '4420',
 '4486',
 '4180',
 '4760',
 '4630',
 '5246',
 '3370',
 '4689',
 '4690',
 '3993',
 '4933',
 '4985',
 '5252',
 '4430',
 '4828',
 '4885',
 '4868',
 '3599',
 '4249',
 '4299',
 '4329',
 '4955',
 '4628',
 '4688',
 '5295',
 '4370',
 '4325',
 '4806',
 '5075',
 '5091',
 '4702',
 '3895',
 '5370',
 '4423',
 '5207',
 '4797',
 '4531',
 '5399',
 '4498',
 '4747',
 '4535',
 '5052',
 '3530',
 '4869',
 '3099',
 '5235',
 '4892',
 '4380',
 '4509',
 '3545',
 '4116',
 '3955',
 '4708',
 '4375',
 '4549',
 '5063',
 '4393',
 '4221',
 '4569',
 '4767',
 '4846',
 

In [582]:
# converting to float
df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Length').cast(pl.Float64))

In [583]:
list(df_specs_B_Dimensions['Length'].unique())

[3099.0,
 3164.0,
 3280.0,
 3335.0,
 3370.0,
 3395.0,
 3429.0,
 3430.0,
 3435.0,
 3445.0,
 3446.0,
 3495.0,
 3500.0,
 3515.0,
 3520.0,
 3530.0,
 3535.0,
 3539.0,
 3545.0,
 3565.0,
 3585.0,
 3590.0,
 3595.0,
 3599.0,
 3600.0,
 3610.0,
 3620.0,
 3636.0,
 3640.0,
 3655.0,
 3675.0,
 3679.0,
 3690.0,
 3695.0,
 3700.0,
 3715.0,
 3723.0,
 3729.0,
 3731.0,
 3746.0,
 3760.0,
 3765.0,
 3775.0,
 3780.0,
 3785.0,
 3788.0,
 3793.0,
 3795.0,
 3801.0,
 3802.0,
 3805.0,
 3810.0,
 3815.0,
 3821.0,
 3825.0,
 3827.0,
 3840.0,
 3845.0,
 3850.0,
 3880.0,
 3884.0,
 3886.0,
 3895.0,
 3900.0,
 3920.0,
 3940.0,
 3941.0,
 3946.0,
 3954.0,
 3955.0,
 3970.0,
 3971.0,
 3976.0,
 3981.0,
 3982.0,
 3985.0,
 3987.0,
 3988.0,
 3989.0,
 3990.0,
 3991.0,
 3992.0,
 3993.0,
 3994.0,
 3995.0,
 3998.0,
 3999.0,
 4000.0,
 4010.0,
 4018.0,
 4032.0,
 4056.0,
 4082.0,
 4095.0,
 4097.0,
 4103.0,
 4107.0,
 4110.0,
 4116.0,
 4129.0,
 4134.0,
 4142.0,
 4145.0,
 4150.0,
 4160.0,
 4177.0,
 4180.0,
 4198.0,
 4200.0,
 4221.0,
 4222.0,
 

##### `Wheelbase`

In [584]:
# first clearing out the leading and trailing spaces

df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Wheelbase').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Dimensions.filter(pl.col('Wheelbase').is_null()).shape[0],'\n',
        'empty',df_specs_B_Dimensions.filter(pl.col('Wheelbase')=='').shape[0],'\n',
        'space', df_specs_B_Dimensions.filter(pl.col('Wheelbase')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Dimensions.filter(pl.col('Wheelbase')=='0').shape[0],'\n',
        'None',df_specs_B_Dimensions.filter(pl.col('Wheelbase')==None).shape[0],'\n',
        'NA',df_specs_B_Dimensions.filter(pl.col('Wheelbase')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Dimensions.filter(pl.col('Wheelbase').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Dimensions.filter(pl.col('Wheelbase')==0).shape[0],'\n',
        

        )

null 3 
 empty 0 
 space 0 
 0 (Str) 0 
 None 3 
 NA 0 



In [516]:
list(df_specs_B_Dimensions['Wheelbase'].unique())

['3022',
 '2843',
 '2638',
 '2505',
 '2786',
 '3002',
 '2462',
 '2990',
 '2693',
 '2435',
 '3025',
 '2455',
 '3105',
 '3075',
 '3003',
 '2496',
 '2916',
 '2806',
 '3095',
 '2489',
 '2350',
 '2502',
 '2775',
 '3050',
 '2490',
 '2730',
 '2622',
 '2780',
 '2765',
 '2729',
 '2650',
 '2648',
 '2841',
 '2720',
 '1840',
 '2469',
 '3216',
 '2794',
 '3430',
 '2703',
 '3295',
 '2895',
 '3074',
 '2360',
 '2501',
 '3197',
 '3065',
 '2380',
 '2829',
 '2560',
 '2673',
 '2711',
 '2575',
 '3070',
 None,
 '2774',
 '2567',
 '2705',
 '2920',
 '2400',
 '2390',
 '3200',
 '2914',
 '2939',
 '2933',
 '2633',
 '2519',
 '2851',
 '2430',
 '2865',
 '2373',
 '2947',
 '2652',
 '2850',
 '2845',
 '2677',
 '2787',
 '2660',
 '2702',
 '2725',
 '2824',
 '3125',
 '2646',
 '2935',
 '2857',
 '2486',
 '2873',
 '2475',
 '2997',
 '2610',
 '3194',
 '2385',
 '2468',
 '2808',
 '2530',
 '2961',
 '2603',
 '2500',
 '2640',
 '2950',
 '2580',
 '2964',
 '2782',
 '2550',
 '2375',
 '3365',
 '2443',
 '2405',
 '2692',
 '2425',
 '2457',
 '2

In [585]:
# converting to float
df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Wheelbase').cast(pl.Float64))

In [586]:
list(df_specs_B_Dimensions['Wheelbase'].unique())

[None,
 1840.0,
 1958.0,
 2175.0,
 2230.0,
 2258.0,
 2300.0,
 2335.0,
 2345.0,
 2348.0,
 2350.0,
 2360.0,
 2365.0,
 2373.0,
 2375.0,
 2380.0,
 2385.0,
 2390.0,
 2400.0,
 2405.0,
 2422.0,
 2425.0,
 2430.0,
 2435.0,
 2440.0,
 2443.0,
 2445.0,
 2450.0,
 2455.0,
 2456.0,
 2457.0,
 2460.0,
 2462.0,
 2464.0,
 2465.0,
 2467.0,
 2468.0,
 2469.0,
 2470.0,
 2475.0,
 2480.0,
 2486.0,
 2489.0,
 2490.0,
 2491.0,
 2495.0,
 2496.0,
 2498.0,
 2499.0,
 2500.0,
 2501.0,
 2502.0,
 2505.0,
 2510.0,
 2512.0,
 2519.0,
 2520.0,
 2524.0,
 2525.0,
 2530.0,
 2540.0,
 2550.0,
 2552.0,
 2553.0,
 2555.0,
 2560.0,
 2567.0,
 2570.0,
 2575.0,
 2578.0,
 2580.0,
 2585.0,
 2587.0,
 2590.0,
 2593.0,
 2595.0,
 2600.0,
 2603.0,
 2610.0,
 2620.0,
 2622.0,
 2630.0,
 2633.0,
 2636.0,
 2637.0,
 2638.0,
 2640.0,
 2646.0,
 2648.0,
 2650.0,
 2651.0,
 2652.0,
 2660.0,
 2662.0,
 2670.0,
 2673.0,
 2677.0,
 2679.0,
 2680.0,
 2681.0,
 2685.0,
 2688.0,
 2690.0,
 2692.0,
 2693.0,
 2699.0,
 2700.0,
 2702.0,
 2703.0,
 2705.0,
 2709.0,
 27

##### `Width`

In [587]:
# first clearing out the leading and trailing spaces

df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Width').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Dimensions.filter(pl.col('Width').is_null()).shape[0],'\n',
        'empty',df_specs_B_Dimensions.filter(pl.col('Width')=='').shape[0],'\n',
        'space', df_specs_B_Dimensions.filter(pl.col('Width')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Dimensions.filter(pl.col('Width')=='0').shape[0],'\n',
        'None',df_specs_B_Dimensions.filter(pl.col('Width')==None).shape[0],'\n',
        'NA',df_specs_B_Dimensions.filter(pl.col('Width')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Dimensions.filter(pl.col('Width').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Dimensions.filter(pl.col('Width')==0).shape[0],'\n',
        

        )

null 0 
 empty 0 
 space 0 
 0 (Str) 0 
 None 0 
 NA 0 



In [588]:
list(df_specs_B_Dimensions['Width'].unique())

['1785',
 '2155',
 '2096',
 '1739',
 '1821',
 '1645',
 '1680',
 '1700',
 '1970',
 '1790',
 '2025',
 '1829',
 '1475',
 '1815',
 '1560',
 '1640',
 '1849',
 '1804',
 '1920',
 '1830',
 '1913',
 '1980',
 '1826',
 '1647',
 '2209',
 '1750',
 '1729',
 '1898',
 '1996',
 '1805',
 '1726',
 '1764',
 '1758',
 '2132',
 '1520',
 '1943',
 '2080',
 '2118',
 '1798',
 '1595',
 '1911',
 '1801',
 '1687',
 '1733',
 '1927',
 '2008',
 '1831',
 '1710',
 '1922',
 '1682',
 '2105',
 '1808',
 '1793',
 '1720',
 '2003',
 '2071',
 '1930',
 '2041',
 '1694',
 '1843',
 '1662',
 '1410',
 '1887',
 '2139',
 '1871',
 '1861',
 '1880',
 '1777',
 '1683',
 '1515',
 '2075',
 '1949',
 '1495',
 '2000',
 '1965',
 '1852',
 '1655',
 '1822',
 '2094',
 '1934',
 '1760',
 '1814',
 '2110',
 '1818',
 '2028',
 '1936',
 '1928',
 '1788',
 '1895',
 '2031',
 '1742',
 '1620',
 '1721',
 '1665',
 '1812',
 '1969',
 '1820',
 '1832',
 '1989',
 '2044',
 '1575',
 '1874',
 '1810',
 '2220',
 '2183',
 '1891',
 '1918',
 '2040',
 '1795',
 '1727',
 '1778',
 

In [589]:
# converting to float
df_specs_B_Dimensions = df_specs_B_Dimensions.with_columns(pl.col('Width').cast(pl.Float64))

In [590]:
list(df_specs_B_Dimensions['Width'].unique())

[1410.0,
 1440.0,
 1475.0,
 1485.0,
 1490.0,
 1495.0,
 1500.0,
 1514.0,
 1515.0,
 1520.0,
 1525.0,
 1540.0,
 1550.0,
 1560.0,
 1574.0,
 1575.0,
 1579.0,
 1595.0,
 1600.0,
 1608.0,
 1620.0,
 1627.0,
 1632.0,
 1635.0,
 1636.0,
 1640.0,
 1642.0,
 1645.0,
 1647.0,
 1655.0,
 1658.0,
 1660.0,
 1662.0,
 1665.0,
 1670.0,
 1677.0,
 1680.0,
 1682.0,
 1683.0,
 1686.0,
 1687.0,
 1690.0,
 1694.0,
 1695.0,
 1698.0,
 1699.0,
 1700.0,
 1703.0,
 1704.0,
 1705.0,
 1706.0,
 1710.0,
 1715.0,
 1720.0,
 1721.0,
 1722.0,
 1725.0,
 1726.0,
 1727.0,
 1728.0,
 1729.0,
 1730.0,
 1731.0,
 1733.0,
 1734.0,
 1735.0,
 1737.0,
 1739.0,
 1740.0,
 1742.0,
 1745.0,
 1748.0,
 1750.0,
 1751.0,
 1752.0,
 1755.0,
 1758.0,
 1760.0,
 1764.0,
 1765.0,
 1767.0,
 1769.0,
 1770.0,
 1772.0,
 1775.0,
 1777.0,
 1778.0,
 1780.0,
 1781.0,
 1783.0,
 1785.0,
 1786.0,
 1788.0,
 1789.0,
 1790.0,
 1793.0,
 1795.0,
 1796.0,
 1798.0,
 1799.0,
 1800.0,
 1801.0,
 1804.0,
 1805.0,
 1808.0,
 1809.0,
 1810.0,
 1811.0,
 1812.0,
 1813.0,
 1814.0,
 

In [592]:
df_specs_B_Dimensions.describe()

describe,profileId,vehicle,Length,Width,Kerb Weight,Wheelbase,Height,Ground Clearance
str,str,str,f64,f64,f64,f64,f64,f64
"""count""","""30201""","""30201""",30201.0,30201.0,30201.0,30201.0,30201.0,30201.0
"""null_count""","""0""","""0""",0.0,0.0,10891.0,3.0,0.0,7437.0
"""mean""",,,4232.142512,1755.005596,1284.578975,2602.1868,1587.576107,178.582784
"""std""",,,436.143498,126.220848,409.958157,183.793675,123.161615,19.216126
"""min""","""D1982769""","""Aston Martin R…",3099.0,1410.0,214.0,1840.0,1165.0,100.0
"""25%""",,,3971.0,1695.0,965.0,2450.0,1495.0,165.0
"""50%""",,,4270.0,1745.0,1160.0,2570.0,1545.0,170.0
"""75%""",,,4585.0,1822.0,1585.0,2740.0,1665.0,190.0
"""max""","""S2805517""","""Volvo XC90 Mom…",5569.0,2220.0,2962.0,3465.0,2100.0,295.5


##### FINAL DIMENSIONS DATASET

In [593]:
df_specs_B_Dimensions.head()

profileId,vehicle,Length,Width,Kerb Weight,Wheelbase,Height,Ground Clearance
str,str,f64,f64,f64,f64,f64,f64
"""D4170559""","""Jeep Compass L…",4395.0,1818.0,1584.0,2636.0,1640.0,
"""D4112409""","""Mercedes-Benz …",4751.0,1820.0,,2865.0,1437.0,
"""D4039375""","""Mercedes-Benz …",4596.0,1770.0,1610.0,2960.0,1447.0,
"""D4134657""","""Jeep Compass L…",4395.0,1818.0,1584.0,2636.0,1640.0,
"""D4056433""","""Jeep Compass L…",4395.0,1818.0,1654.0,2636.0,1640.0,


In [594]:
df_specs_B_Dimensions.columns

['profileId',
 'vehicle',
 'Length',
 'Width',
 'Kerb Weight',
 'Wheelbase',
 'Height',
 'Ground Clearance']

In [595]:
df_specs_B_Dimensions = df_specs_B_Dimensions.rename({
                                                     'Ground Clearance': 'Ground_Clearance_(mm)',
                                                     'Height': 'Height_(mm)',
                                                     'Kerb Weight': 'Kerb_Weight_(mm)',
                                                     'Length': 'Length_(mm)',
                                                     'Wheelbase': 'Wheelbase_(mm)',
                                                     'Width': 'Width_(mm)'}
                                                    )

In [596]:
df_specs_B_Dimensions.head()

profileId,vehicle,Length_(mm),Width_(mm),Kerb_Weight_(mm),Wheelbase_(mm),Height_(mm),Ground_Clearance_(mm)
str,str,f64,f64,f64,f64,f64,f64
"""D4170559""","""Jeep Compass L…",4395.0,1818.0,1584.0,2636.0,1640.0,
"""D4112409""","""Mercedes-Benz …",4751.0,1820.0,,2865.0,1437.0,
"""D4039375""","""Mercedes-Benz …",4596.0,1770.0,1610.0,2960.0,1447.0,
"""D4134657""","""Jeep Compass L…",4395.0,1818.0,1584.0,2636.0,1640.0,
"""D4056433""","""Jeep Compass L…",4395.0,1818.0,1654.0,2636.0,1640.0,


In [597]:
df_specs_B_Dimensions.write_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-II\specs_Dimensions_B.csv')

#### `Capacity`

In [598]:
#looking at the data

df_specs_B_Capacity = df_specifications_B.filter(pl.col('spec_category')=='Capacity')

In [599]:
df_specs_B_Capacity.head()

specName,specValue,specUnit,spec_category,profileId,vehicle
str,str,str,str,str,str
"""Fuel Tank Capa…","""50""","""litres""","""Capacity""","""D4135089""","""Tata Harrier X…"
"""Doors""","""4""","""Doors""","""Capacity""","""D4170587""","""BMW 3 Series G…"
"""Doors""","""5""","""Doors""","""Capacity""","""D4170559""","""Jeep Compass L…"
"""Seating Capaci…","""5""","""Person""","""Capacity""","""D4170559""","""Jeep Compass L…"
"""No of Seating …","""2""","""Rows""","""Capacity""","""D4170559""","""Jeep Compass L…"


In [600]:
df_specs_B_Capacity[['specUnit','specName']].unique()

specUnit,specName
str,str
"""litres""","""Fuel Tank Capa…"
"""Doors""","""Doors"""
"""litres""","""Bootspace"""
"""Person""","""Seating Capaci…"
"""Rows""","""No of Seating …"


##### PIVOTING TABLE

In [601]:
df_specs_B_Capacity = df_specs_B_Capacity.pivot(index = ['profileId','vehicle'],columns = 'specName', values = 'specValue')

In [602]:
df_specs_B_Capacity.head()

profileId,vehicle,Fuel Tank Capacity,Doors,Seating Capacity,No of Seating Rows,Bootspace
str,str,str,str,str,str,str
"""D4135089""","""Tata Harrier X…","""50""","""5""","""5""","""2""","""425"""
"""D4170587""","""BMW 3 Series G…",,"""4""","""5""","""2""",
"""D4170559""","""Jeep Compass L…","""60""","""5""","""5""","""2""","""438"""
"""D4112409""","""Mercedes-Benz …","""66""","""4""","""5""","""2""",
"""D4039319""","""Mercedes-Benz …","""66""","""4""","""4""","""2""","""435"""


##### `Fuel Tank Capacity`

In [603]:
# first clearing out the leading and trailing spaces

df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('Fuel Tank Capacity').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Capacity.filter(pl.col('Fuel Tank Capacity').is_null()).shape[0],'\n',
        'empty',df_specs_B_Capacity.filter(pl.col('Fuel Tank Capacity')=='').shape[0],'\n',
        'space', df_specs_B_Capacity.filter(pl.col('Fuel Tank Capacity')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Capacity.filter(pl.col('Fuel Tank Capacity')=='0').shape[0],'\n',
        'None',df_specs_B_Capacity.filter(pl.col('Fuel Tank Capacity')==None).shape[0],'\n',
        'NA',df_specs_B_Capacity.filter(pl.col('Fuel Tank Capacity')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Capacity.filter(pl.col('Fuel Tank Capacity').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Capacity.filter(pl.col('Fuel Tank Capacity')==0).shape[0],'\n',
        

        )

null 674 
 empty 0 
 space 0 
 0 (Str) 0 
 None 674 
 NA 0 



* large number of null values

In [604]:
list(df_specs_B_Capacity['Fuel Tank Capacity'].unique())

['90.5',
 '87',
 '57',
 '50',
 '58',
 '37',
 '26',
 '66',
 '60.9',
 '48',
 '86',
 '93',
 '85',
 '28',
 '59',
 '110',
 '72',
 '27',
 '47',
 '52.5',
 '63',
 '93.5',
 '15',
 '68',
 '92',
 '88.5',
 '36',
 '62',
 '105',
 '54',
 '83',
 '76',
 '64',
 '70.6',
 '81',
 '60',
 '138',
 '45',
 '90',
 '96',
 None,
 '71',
 '35',
 '89',
 '67.5',
 '51',
 '44',
 '74',
 '30',
 '43',
 '41',
 '82.5',
 '38',
 '61',
 '40',
 '52',
 '77',
 '80',
 '56',
 '53',
 '55',
 '26.2',
 '104',
 '75',
 '66.5',
 '24',
 '67',
 '42',
 '95',
 '70',
 '73',
 '65',
 '100',
 '32',
 '78',
 '82']

In [503]:
# converting to float
df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('Fuel Tank Capacity').cast(pl.Float64))

In [504]:
list(df_specs_B_Capacity['Fuel Tank Capacity'].unique())

[None,
 100.0,
 109.0,
 110.0,
 112.0,
 113.0,
 114.0,
 116.0,
 117.0,
 118.0,
 120.0,
 126.0,
 128.0,
 130.0,
 133.0,
 134.0,
 135.0,
 136.0,
 137.0,
 138.0,
 139.0,
 140.0,
 141.0,
 142.0,
 144.0,
 145.0,
 147.0,
 149.0,
 150.0,
 151.0,
 152.0,
 154.0,
 155.0,
 156.0,
 157.0,
 158.0,
 159.0,
 160.0,
 161.0,
 163.0,
 164.0,
 165.0,
 167.0,
 168.0,
 170.0,
 171.0,
 172.0,
 174.0,
 175.0,
 176.0,
 177.0,
 179.0,
 180.0,
 181.0,
 182.0,
 183.0,
 184.0,
 185.0,
 186.0,
 187.0,
 188.0,
 189.0,
 190.0,
 192.0,
 195.0,
 196.0,
 197.0,
 198.0,
 200.0,
 201.0,
 202.0,
 204.0,
 204.8,
 205.0,
 206.0,
 208.0,
 209.0,
 210.0,
 211.0,
 212.0,
 213.0,
 214.0,
 215.0,
 216.0,
 217.0,
 218.0,
 219.0,
 220.0,
 221.0,
 223.0,
 225.0,
 226.0,
 227.0,
 230.0,
 238.0,
 239.8,
 244.0,
 295.5]

##### `Doors`

In [605]:
# first clearing out the leading and trailing spaces

df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('Doors').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Capacity.filter(pl.col('Doors').is_null()).shape[0],'\n',
        'empty',df_specs_B_Capacity.filter(pl.col('Doors')=='').shape[0],'\n',
        'space', df_specs_B_Capacity.filter(pl.col('Doors')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Capacity.filter(pl.col('Doors')=='0').shape[0],'\n',
        'None',df_specs_B_Capacity.filter(pl.col('Doors')==None).shape[0],'\n',
        'NA',df_specs_B_Capacity.filter(pl.col('Doors')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Capacity.filter(pl.col('Doors').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Capacity.filter(pl.col('Doors')==0).shape[0],'\n',
        

        )

null 9 
 empty 0 
 space 0 
 0 (Str) 0 
 None 9 
 NA 0 



In [606]:
# converting to float
df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('Doors').cast(pl.Float64))

In [607]:
list(df_specs_B_Capacity['Doors'].unique())

[None, 2.0, 3.0, 4.0, 5.0]

##### `Seating Capacity`

In [608]:
# first clearing out the leading and trailing spaces

df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('Seating Capacity').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Capacity.filter(pl.col('Seating Capacity').is_null()).shape[0],'\n',
        'empty',df_specs_B_Capacity.filter(pl.col('Seating Capacity')=='').shape[0],'\n',
        'space', df_specs_B_Capacity.filter(pl.col('Seating Capacity')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Capacity.filter(pl.col('Seating Capacity')=='0').shape[0],'\n',
        'None',df_specs_B_Capacity.filter(pl.col('Seating Capacity')==None).shape[0],'\n',
        'NA',df_specs_B_Capacity.filter(pl.col('Seating Capacity')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Capacity.filter(pl.col('Seating Capacity').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Capacity.filter(pl.col('Seating Capacity')==0).shape[0],'\n',
        

        )

null 3 
 empty 0 
 space 0 
 0 (Str) 0 
 None 3 
 NA 0 



In [609]:
list(df_specs_B_Capacity['Seating Capacity'].unique())

['8', '9', '7 & 8', None, '7 & 9', '4', '6', '2', '7', '5', '10']

In [610]:
df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('Seating Capacity').str.replace("7 & 9","7").str.replace("7 & 8","7"))

In [611]:
list(df_specs_B_Capacity['Seating Capacity'].unique())

[None, '7', '4', '6', '10', '2', '9', '8', '5']

In [612]:
# converting to float
df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('Seating Capacity').cast(pl.Float64))

In [613]:
list(df_specs_B_Capacity['Seating Capacity'].unique())

[None, 2.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0]

##### `No of Seating Rows`

In [614]:
# first clearing out the leading and trailing spaces

df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('No of Seating Rows').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Capacity.filter(pl.col('No of Seating Rows').is_null()).shape[0],'\n',
        'empty',df_specs_B_Capacity.filter(pl.col('No of Seating Rows')=='').shape[0],'\n',
        'space', df_specs_B_Capacity.filter(pl.col('No of Seating Rows')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Capacity.filter(pl.col('No of Seating Rows')=='0').shape[0],'\n',
        'None',df_specs_B_Capacity.filter(pl.col('No of Seating Rows')==None).shape[0],'\n',
        'NA',df_specs_B_Capacity.filter(pl.col('No of Seating Rows')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Capacity.filter(pl.col('No of Seating Rows').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Capacity.filter(pl.col('No of Seating Rows')==0).shape[0],'\n',
        

        )

null 2704 
 empty 0 
 space 0 
 0 (Str) 0 
 None 2704 
 NA 0 



* a lot of null values

In [615]:
list(df_specs_B_Capacity['No of Seating Rows'].unique())

[None, '2', '3', '1', '4']

In [616]:
# converting to float
df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('No of Seating Rows').cast(pl.Float64))

In [617]:
list(df_specs_B_Capacity['No of Seating Rows'].unique())

[None, 1.0, 2.0, 3.0, 4.0]

##### `Bootspace`

In [618]:
# first clearing out the leading and trailing spaces

df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('Bootspace').str.strip_chars())

# checking inconsistencies and null values

print(

        #string
        'null', df_specs_B_Capacity.filter(pl.col('Bootspace').is_null()).shape[0],'\n',
        'empty',df_specs_B_Capacity.filter(pl.col('Bootspace')=='').shape[0],'\n',
        'space', df_specs_B_Capacity.filter(pl.col('Bootspace')==' ').shape[0],'\n',
        '0 (Str)',df_specs_B_Capacity.filter(pl.col('Bootspace')=='0').shape[0],'\n',
        'None',df_specs_B_Capacity.filter(pl.col('Bootspace')==None).shape[0],'\n',
        'NA',df_specs_B_Capacity.filter(pl.col('Bootspace')=='Not Available').shape[0],'\n',
    
        #numeric
        #'nan', df_specs_B_Capacity.filter(pl.col('Bootspace').is_nan()).shape[0],'\n',
        #'0', df_specs_B_Capacity.filter(pl.col('Bootspace')==0).shape[0],'\n',
        

        )

null 7805 
 empty 0 
 space 0 
 0 (Str) 0 
 None 7805 
 NA 0 



* huge no. of null values

In [619]:
list(df_specs_B_Capacity['Bootspace'].unique())

['525',
 '382',
 '214',
 '707',
 '440',
 '330',
 '295',
 '969',
 '605',
 '352',
 '392',
 '360',
 '521',
 '258',
 '897',
 '486',
 '410',
 '212',
 '1050',
 '400',
 '540',
 '346',
 '175',
 '454',
 '154',
 '194',
 '128',
 '243',
 '480',
 '390',
 '308',
 '479',
 '343',
 '513',
 '370',
 '210',
 '460',
 '508',
 '506',
 '373',
 '500',
 '268',
 '300',
 '565',
 '556',
 '438',
 '251',
 '645',
 '278',
 '472',
 '461',
 '209',
 '125',
 '222',
 '425',
 '592',
 '326',
 '447',
 '230',
 '234',
 '285',
 '354',
 '465',
 '359',
 '366',
 '510',
 '600',
 '586',
 '515',
 '580',
 '211',
 '391',
 '448',
 '371',
 '242',
 '208',
 '625',
 '355',
 '570',
 '363',
 '320',
 '490',
 '180',
 '260',
 '341',
 '94',
 '93',
 '279',
 '232',
 '80',
 '347',
 '177',
 '135',
 '257',
 '433',
 '453',
 '248',
 '857',
 '270',
 '215',
 '420',
 '358',
 '493',
 '332',
 '236',
 '432',
 '1400',
 '218',
 '755',
 '185',
 '235',
 '615',
 '623',
 '421',
 '290',
 '495',
 '884',
 '339',
 '280',
 '483',
 '590',
 '435',
 '378',
 '740',
 '680',
 

In [620]:
# converting to float
df_specs_B_Capacity = df_specs_B_Capacity.with_columns(pl.col('Bootspace').cast(pl.Float64))

In [621]:
list(df_specs_B_Capacity['Bootspace'].unique())

[None,
 80.0,
 84.0,
 93.0,
 94.0,
 110.0,
 122.0,
 125.0,
 128.0,
 135.0,
 154.0,
 155.0,
 170.0,
 174.0,
 175.0,
 177.0,
 180.0,
 185.0,
 190.0,
 194.0,
 205.0,
 207.0,
 208.0,
 209.0,
 210.0,
 211.0,
 212.0,
 214.0,
 215.0,
 216.0,
 218.0,
 220.0,
 222.0,
 223.0,
 225.0,
 230.0,
 232.0,
 234.0,
 235.0,
 236.0,
 240.0,
 242.0,
 243.0,
 248.0,
 251.0,
 256.0,
 257.0,
 258.0,
 260.0,
 265.0,
 268.0,
 270.0,
 278.0,
 279.0,
 280.0,
 284.0,
 285.0,
 290.0,
 295.0,
 296.0,
 297.0,
 300.0,
 308.0,
 311.0,
 312.0,
 313.0,
 315.0,
 316.0,
 318.0,
 320.0,
 324.0,
 326.0,
 328.0,
 330.0,
 332.0,
 335.0,
 336.0,
 339.0,
 340.0,
 341.0,
 343.0,
 345.0,
 346.0,
 347.0,
 350.0,
 352.0,
 353.0,
 354.0,
 355.0,
 358.0,
 359.0,
 360.0,
 363.0,
 366.0,
 368.0,
 370.0,
 371.0,
 373.0,
 378.0,
 380.0,
 382.0,
 383.0,
 384.0,
 385.0,
 390.0,
 391.0,
 392.0,
 395.0,
 400.0,
 402.0,
 405.0,
 407.0,
 410.0,
 412.0,
 416.0,
 419.0,
 420.0,
 421.0,
 425.0,
 430.0,
 432.0,
 433.0,
 435.0,
 438.0,
 440.0,
 442.

##### FINAL CAPACITY DATASET

In [622]:
df_specs_B_Capacity.head()

profileId,vehicle,Fuel Tank Capacity,Doors,Seating Capacity,No of Seating Rows,Bootspace
str,str,str,f64,f64,f64,f64
"""D4135089""","""Tata Harrier X…","""50""",5.0,5.0,2.0,425.0
"""D4170587""","""BMW 3 Series G…",,4.0,5.0,2.0,
"""D4170559""","""Jeep Compass L…","""60""",5.0,5.0,2.0,438.0
"""D4112409""","""Mercedes-Benz …","""66""",4.0,5.0,2.0,
"""D4039319""","""Mercedes-Benz …","""66""",4.0,4.0,2.0,435.0


In [623]:
df_specs_B_Capacity.columns

['profileId',
 'vehicle',
 'Fuel Tank Capacity',
 'Doors',
 'Seating Capacity',
 'No of Seating Rows',
 'Bootspace']

In [625]:
df_specs_B_Capacity = df_specs_B_Capacity.rename({
                                                 'Bootspace': 'Bootspace_(litres)',
                                                 'Doors': 'Doors',
                                                 'Fuel Tank Capacity': 'Fuel_Tank_Capacity_(litres)',
                                                 'No of Seating Rows': 'Seating_Rows_(rows)',
                                                 'Seating Capacity': 'Seating_Capacity_(persons)'}
                                                )

In [626]:
df_specs_B_Capacity.head()

profileId,vehicle,Fuel_Tank_Capacity_(litres),Doors,Seating_Capacity_(persons),Seating_Rows_(rows),Bootspace_(litres)
str,str,str,f64,f64,f64,f64
"""D4135089""","""Tata Harrier X…","""50""",5.0,5.0,2.0,425.0
"""D4170587""","""BMW 3 Series G…",,4.0,5.0,2.0,
"""D4170559""","""Jeep Compass L…","""60""",5.0,5.0,2.0,438.0
"""D4112409""","""Mercedes-Benz …","""66""",4.0,5.0,2.0,
"""D4039319""","""Mercedes-Benz …","""66""",4.0,4.0,2.0,435.0


In [627]:
df_specs_B_Capacity.write_csv(r'C:\Users\pryns\OneDrive\Documents\Python Works\zz.carwale data\02.Pre-Processing\Pre-processed Data Stage-II\specs_Capacity_B.csv')