## INTRODUCTION
In this analysis, I aim to determine the relationship between vehicles features (like engine size, fuel types, transmissions system and cylinders) and fuel effieciency. Examining how various vehicles characteristics influence fuel consumption and emmissions. By assesing the dataset, I should provide actionable insights based on data driven to guide the company's decision making that align with fuel effieciency and market trends.

## PROBLEM STATEMENT
My company has observed that leading automotive manufacturers are producing new, highly fuel-efficient vehicles, and they are interested in joining this competitive market. They plan to establish a vehicle manufacturing plant but lack in-depth knowledge about vehicle design and performance characteristics.I have been tasked with analyzing the current market trends to identify the types of vehicles that excel in fuel efficiency. Using the fuel consumption dataset, I will explore various vehicle features to determine which configurations achieve the best fuel consumption and emissions performance. These findings will provide actionable insights to guide the company's decision-making on which type of vehicle to manufacture, aligning with current fuel efficiency standards and market demand.

## OBJECTIVES
1. Perform EDA to explore the dataset to identify the relationship between vehicles features and fuel efficiency
2. Examine optimal vehicle features for fuel efficiency.
- Best fuel efficient
- Lowest emmisions

3. Conduct Statistical Hypothesis. Identify statistically significant relationnships between vehicles characteristics and fuel efficiency.
4. Build a predictive regression model . To analyze the effect of various vehicle attributes on fuel consumption and emmision using an OLS regression model.

## DATA LOADING AND DESCRIPTION

TRANSMISSION

A - Automatic

AM - Automated Manual

AS - Automatic with Select Shift

AV - Continously Variable

M - Manual 

Number (3-10) - Number of gears

FUEL TYPE 

X - Regular gasoline

Z - Premium gasoline

E - Ethanol

D - Diesel

N - Natural Gas


In [20]:
#Importing the necessary libraries
import pandas as pd #Pandas Library
import numpy as np #Numpy Library
import matplotlib.pyplot as plt #Matplotlib library
import seaborn as sns #Seaborn library
from scipy import stats #scipy library



In [21]:
df = pd.read_csv("../Data/Fuel_Consumption_2000_2022.csv")
df

Unnamed: 0,YEAR,MAKE,MODEL,VEHICLE CLASS,ENGINE SIZE,CYLINDERS,TRANSMISSION,FUEL,FUEL CONSUMPTION,HWY (L/100 km),COMB (L/100 km),COMB (mpg),EMISSIONS
0,2000,ACURA,1.6EL,COMPACT,1.6,4,A4,X,9.2,6.7,8.1,35,186
1,2000,ACURA,1.6EL,COMPACT,1.6,4,M5,X,8.5,6.5,7.6,37,175
2,2000,ACURA,3.2TL,MID-SIZE,3.2,6,AS5,Z,12.2,7.4,10.0,28,230
3,2000,ACURA,3.5RL,MID-SIZE,3.5,6,A4,Z,13.4,9.2,11.5,25,264
4,2000,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,10.0,7.0,8.6,33,198
...,...,...,...,...,...,...,...,...,...,...,...,...,...
22551,2022,Volvo,XC40 T5 AWD,SUV: Small,2.0,4,AS8,Z,10.7,7.7,9.4,30,219
22552,2022,Volvo,XC60 B5 AWD,SUV: Small,2.0,4,AS8,Z,10.5,8.1,9.4,30,219
22553,2022,Volvo,XC60 B6 AWD,SUV: Small,2.0,4,AS8,Z,11.0,8.7,9.9,29,232
22554,2022,Volvo,XC90 T5 AWD,SUV: Standard,2.0,4,AS8,Z,11.5,8.4,10.1,28,236


In [41]:
#Copy the original Dataset
data = df.copy()

In [22]:
df.head()

Unnamed: 0,YEAR,MAKE,MODEL,VEHICLE CLASS,ENGINE SIZE,CYLINDERS,TRANSMISSION,FUEL,FUEL CONSUMPTION,HWY (L/100 km),COMB (L/100 km),COMB (mpg),EMISSIONS
0,2000,ACURA,1.6EL,COMPACT,1.6,4,A4,X,9.2,6.7,8.1,35,186
1,2000,ACURA,1.6EL,COMPACT,1.6,4,M5,X,8.5,6.5,7.6,37,175
2,2000,ACURA,3.2TL,MID-SIZE,3.2,6,AS5,Z,12.2,7.4,10.0,28,230
3,2000,ACURA,3.5RL,MID-SIZE,3.5,6,A4,Z,13.4,9.2,11.5,25,264
4,2000,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,10.0,7.0,8.6,33,198


In [42]:
#Checking rows and columns
df.shape


(22556, 13)

our dataset has 22556 rows and 13 columns

In [43]:
#Statistical for numerical columns
df.describe()

Unnamed: 0,YEAR,ENGINE SIZE,CYLINDERS,FUEL CONSUMPTION,HWY (L/100 km),COMB (L/100 km),COMB (mpg),EMISSIONS
count,22556.0,22556.0,22556.0,22556.0,22556.0,22556.0,22556.0,22556.0
mean,2011.554442,3.356646,5.854141,12.763513,8.919126,11.034341,27.374534,250.068452
std,6.298269,1.335425,1.819597,3.500999,2.274764,2.91092,7.376982,59.355276
min,2000.0,0.8,2.0,3.5,3.2,3.6,11.0,83.0
25%,2006.0,2.3,4.0,10.4,7.3,9.1,22.0,209.0
50%,2012.0,3.0,6.0,12.3,8.4,10.6,27.0,243.0
75%,2017.0,4.2,8.0,14.725,10.2,12.7,31.0,288.0
max,2022.0,8.4,16.0,30.6,20.9,26.1,78.0,608.0


In [44]:
#Checking our columns by name
df.columns

Index(['YEAR', 'MAKE', 'MODEL', 'VEHICLE CLASS', 'ENGINE SIZE', 'CYLINDERS',
       'TRANSMISSION', 'FUEL', 'FUEL CONSUMPTION', 'HWY (L/100 km)',
       'COMB (L/100 km)', 'COMB (mpg)', 'EMISSIONS'],
      dtype='object')

## EXPloratory Data Analysis (EDA)

In [46]:
#Data Understanding
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22556 entries, 0 to 22555
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   YEAR              22556 non-null  int64  
 1   MAKE              22556 non-null  object 
 2   MODEL             22556 non-null  object 
 3   VEHICLE CLASS     22556 non-null  object 
 4   ENGINE SIZE       22556 non-null  float64
 5   CYLINDERS         22556 non-null  int64  
 6   TRANSMISSION      22556 non-null  object 
 7   FUEL              22556 non-null  object 
 8   FUEL CONSUMPTION  22556 non-null  float64
 9   HWY (L/100 km)    22556 non-null  float64
 10  COMB (L/100 km)   22556 non-null  float64
 11  COMB (mpg)        22556 non-null  int64  
 12  EMISSIONS         22556 non-null  int64  
dtypes: float64(4), int64(4), object(5)
memory usage: 2.2+ MB


In [47]:
#Converting the YEAR from object to Datetime datatype
df['YEAR'] = pd.to_datetime(df['YEAR'])

In [48]:
#Data Understanding
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22556 entries, 0 to 22555
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   YEAR              22556 non-null  datetime64[ns]
 1   MAKE              22556 non-null  object        
 2   MODEL             22556 non-null  object        
 3   VEHICLE CLASS     22556 non-null  object        
 4   ENGINE SIZE       22556 non-null  float64       
 5   CYLINDERS         22556 non-null  int64         
 6   TRANSMISSION      22556 non-null  object        
 7   FUEL              22556 non-null  object        
 8   FUEL CONSUMPTION  22556 non-null  float64       
 9   HWY (L/100 km)    22556 non-null  float64       
 10  COMB (L/100 km)   22556 non-null  float64       
 11  COMB (mpg)        22556 non-null  int64         
 12  EMISSIONS         22556 non-null  int64         
dtypes: datetime64[ns](1), float64(4), int64(3), object(5)
memory usage: 2.2+ MB


In [49]:
#Checking Null values from our Dataset
df.isna().sum()

YEAR                0
MAKE                0
MODEL               0
VEHICLE CLASS       0
ENGINE SIZE         0
CYLINDERS           0
TRANSMISSION        0
FUEL                0
FUEL CONSUMPTION    0
HWY (L/100 km)      0
COMB (L/100 km)     0
COMB (mpg)          0
EMISSIONS           0
dtype: int64

There is no null value in our dataset

In [50]:
#Checking duplicates
df.duplicated().sum()

1

In [52]:
#Verifying the duplicate
df[df.duplicated(keep=False)]

Unnamed: 0,YEAR,MAKE,MODEL,VEHICLE CLASS,ENGINE SIZE,CYLINDERS,TRANSMISSION,FUEL,FUEL CONSUMPTION,HWY (L/100 km),COMB (L/100 km),COMB (mpg),EMISSIONS
377,1970-01-01 00:00:00.000002,LAND ROVER,DISCOVERY SERIES II 4X4,SUV,4.0,8,A4,Z,17.7,12.7,15.4,18,354
378,1970-01-01 00:00:00.000002,LAND ROVER,DISCOVERY SERIES II 4X4,SUV,4.0,8,A4,Z,17.7,12.7,15.4,18,354


In [32]:
df.drop_duplicates()

Unnamed: 0,YEAR,MAKE,MODEL,VEHICLE CLASS,ENGINE SIZE,CYLINDERS,TRANSMISSION,FUEL,FUEL CONSUMPTION,HWY (L/100 km),COMB (L/100 km),COMB (mpg),EMISSIONS
0,2000,ACURA,1.6EL,COMPACT,1.6,4,A4,X,9.2,6.7,8.1,35,186
1,2000,ACURA,1.6EL,COMPACT,1.6,4,M5,X,8.5,6.5,7.6,37,175
2,2000,ACURA,3.2TL,MID-SIZE,3.2,6,AS5,Z,12.2,7.4,10.0,28,230
3,2000,ACURA,3.5RL,MID-SIZE,3.5,6,A4,Z,13.4,9.2,11.5,25,264
4,2000,ACURA,INTEGRA,SUBCOMPACT,1.8,4,A4,X,10.0,7.0,8.6,33,198
...,...,...,...,...,...,...,...,...,...,...,...,...,...
22551,2022,Volvo,XC40 T5 AWD,SUV: Small,2.0,4,AS8,Z,10.7,7.7,9.4,30,219
22552,2022,Volvo,XC60 B5 AWD,SUV: Small,2.0,4,AS8,Z,10.5,8.1,9.4,30,219
22553,2022,Volvo,XC60 B6 AWD,SUV: Small,2.0,4,AS8,Z,11.0,8.7,9.9,29,232
22554,2022,Volvo,XC90 T5 AWD,SUV: Standard,2.0,4,AS8,Z,11.5,8.4,10.1,28,236


In [33]:
 df['VEHICLE CLASS'].value_counts()

SUV                         2640
COMPACT                     2636
MID-SIZE                    2300
PICKUP TRUCK - STANDARD     1689
SUBCOMPACT                  1559
FULL-SIZE                   1086
TWO-SEATER                   999
SUV: Small                   929
SUV - SMALL                  827
MINICOMPACT                  783
STATION WAGON - SMALL        737
Mid-size                     660
SUV: Standard                608
Pickup truck: Standard       515
SUV - STANDARD               514
Compact                      491
Subcompact                   451
Full-size                    417
PICKUP TRUCK - SMALL         403
MINIVAN                      366
STATION WAGON - MID-SIZE     343
VAN - CARGO                  332
Two-seater                   313
VAN - PASSENGER              287
Minicompact                  211
Station wagon: Small         140
Pickup truck: Small          108
Special purpose vehicle       62
SPECIAL PURPOSE VEHICLE       52
Station wagon: Mid-size       44
Minivan   

In [34]:
df["FUEL"].value_counts()

X    11822
Z     9316
E     1071
D      314
N       33
Name: FUEL, dtype: int64