# Exploratory Data Analysis

In this notebook I carried out a quick EDA of the getaround pricing data to check :  
- Missing values
- Target  and explanatory variables type and distribution  
- Target  and explanatory variables relationships
- Outlier samples

## Import libraries

In [1]:
import pandas as pd

## Import data

In [2]:
df = pd.read_csv('../data/get_around_pricing_project_raw.csv').iloc[:, 1:]
print('Shape of princing data: There are {} samples and {} variables'.format(df.shape[0], df.shape[1]) )
df.head()


Shape of princing data: There are 4843 samples and 14 variables


Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
0,Citroën,140411,100,diesel,black,convertible,True,True,False,False,True,True,True,106
1,Citroën,13929,317,petrol,grey,convertible,True,True,False,False,False,True,True,264
2,Citroën,183297,120,diesel,white,convertible,False,False,False,False,True,False,True,101
3,Citroën,128035,135,diesel,red,convertible,True,True,False,False,True,True,True,158
4,Citroën,97097,160,diesel,silver,convertible,True,True,False,False,False,True,True,183


**Target Variable** : rental price per day

## 1. Missing values

In [3]:
# Check data types and missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4843 entries, 0 to 4842
Data columns (total 14 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   model_key                  4843 non-null   object
 1   mileage                    4843 non-null   int64 
 2   engine_power               4843 non-null   int64 
 3   fuel                       4843 non-null   object
 4   paint_color                4843 non-null   object
 5   car_type                   4843 non-null   object
 6   private_parking_available  4843 non-null   bool  
 7   has_gps                    4843 non-null   bool  
 8   has_air_conditioning       4843 non-null   bool  
 9   automatic_car              4843 non-null   bool  
 10  has_getaround_connect      4843 non-null   bool  
 11  has_speed_regulator        4843 non-null   bool  
 12  winter_tires               4843 non-null   bool  
 13  rental_price_per_day       4843 non-null   int64 
dtypes: bool(

**There are no missing values**

## 2. Variables distribution

In [4]:
# Describe variables
df.describe(include='all')

Unnamed: 0,model_key,mileage,engine_power,fuel,paint_color,car_type,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
count,4843,4843.0,4843.0,4843,4843,4843,4843,4843,4843,4843,4843,4843,4843,4843.0
unique,28,,,4,10,8,2,2,2,2,2,2,2,
top,Citroën,,,diesel,black,estate,True,True,False,False,False,False,True,
freq,969,,,4641,1633,1606,2662,3839,3865,3881,2613,3674,4514,
mean,,140962.8,128.98823,,,,,,,,,,,121.214536
std,,60196.74,38.99336,,,,,,,,,,,33.568268
min,,-64.0,0.0,,,,,,,,,,,10.0
25%,,102913.5,100.0,,,,,,,,,,,104.0
50%,,141080.0,120.0,,,,,,,,,,,119.0
75%,,175195.5,135.0,,,,,,,,,,,136.0


In [5]:
# Mileage outlier: there is a negative mileage value

df.mileage.sort_values()

2938        -64
2409        476
4372        612
3935        706
3687       2399
         ...   
3198     405816
2829     439060
2350     477571
557      484615
3732    1000376
Name: mileage, Length: 4843, dtype: int64

In [6]:
# Target variable has a really low price: 10 euros per day, which seems weird. Check this
# Since there are many other low values, keep them

df.rental_price_per_day.sort_values()

565      10
630      10
879      10
2829     10
1832     10
       ... 
2938    274
4146    287
90      309
4684    378
4753    422
Name: rental_price_per_day, Length: 4843, dtype: int64

In [7]:
# Check info for car with lowest price of rental per day
df.iloc[565,:]

model_key                    Citroën
mileage                       179358
engine_power                     120
fuel                          diesel
paint_color                    black
car_type                      estate
private_parking_available      False
has_gps                         True
has_air_conditioning           False
automatic_car                  False
has_getaround_connect          False
has_speed_regulator            False
winter_tires                    True
rental_price_per_day              10
Name: 565, dtype: object

In [8]:
# Compate to info for car with highest price of rental per day
df.iloc[4753,:]

model_key                       BMW
mileage                       72515
engine_power                    135
fuel                         diesel
paint_color                    blue
car_type                        suv
private_parking_available     False
has_gps                       False
has_air_conditioning           True
automatic_car                 False
has_getaround_connect         False
has_speed_regulator           False
winter_tires                  False
rental_price_per_day            422
Name: 4753, dtype: object

## 3. Variable relationships

In [9]:
corrMatrix = df.corr()
display (corrMatrix)

Unnamed: 0,mileage,engine_power,private_parking_available,has_gps,has_air_conditioning,automatic_car,has_getaround_connect,has_speed_regulator,winter_tires,rental_price_per_day
mileage,1.0,-0.050116,0.065258,0.009695,0.003621,-0.052857,0.046706,-0.029626,0.154827,-0.448912
engine_power,-0.050116,1.0,0.327213,0.201202,0.312789,0.447769,0.341004,0.232058,0.008905,0.625645
private_parking_available,0.065258,0.327213,1.0,0.305965,0.254764,0.230125,0.27832,0.134274,0.243831,0.281358
has_gps,0.009695,0.201202,0.305965,1.0,0.150669,0.149922,0.285422,0.136106,0.370019,0.310889
has_air_conditioning,0.003621,0.312789,0.254764,0.150669,1.0,0.199477,0.198823,0.144153,0.062218,0.245386
automatic_car,-0.052857,0.447769,0.230125,0.149922,0.199477,1.0,0.250277,0.153347,0.126184,0.419761
has_getaround_connect,0.046706,0.341004,0.27832,0.285422,0.198823,0.250277,1.0,0.257245,0.203305,0.318486
has_speed_regulator,-0.029626,0.232058,0.134274,0.136106,0.144153,0.153347,0.257245,1.0,0.129273,0.227547
winter_tires,0.154827,0.008905,0.243831,0.370019,0.062218,0.126184,0.203305,0.129273,1.0,0.018277
rental_price_per_day,-0.448912,0.625645,0.281358,0.310889,0.245386,0.419761,0.318486,0.227547,0.018277,1.0


## 4. Dataset summary  

- There are **4843 samples** and **14 variables** 
 
 - The **target variable** is the **rental pricer per day**
- There are **no missing values**  
- There is **one outlier sample** with a **negative mileage**
- Variable types:  
  - **Categorical** (11):  
    - Boolean (7) : *private_parking_available, has_gps, has_air_conditioning, automatic_car, has_getaround_connect, has_speed_regulator, winter_tires*   
    - Str (4): *model_key, fuel, paint_color, car_type*      
  <br>
- **Numerical** (3): *mileage, engine power*  
- Variable relationships:  
  - **No collinearity** found  
  - A **slight** positive **correlation** between **target** variable and **engine_power** (Pearson corr of 0.63)

In [11]:
# Export clean dataframe without outlier
df_clean = df.drop(2938)
print(df.shape)
print(df_clean.shape)

df_clean.to_csv('../data/get_around_pricing_project_clean.csv', index=False)

(4843, 14)
(4842, 14)
