# Using Pandas

In [1]:
import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 200)
## to make it possible to display multiple output inside one cell 
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<b>load the data from the vehicles.csv file into pandas data frame

In [72]:
## Your Code here
#http://localhost:8888/tree/Documents/GitHub/IH_DA_FT_MAR_2023/Class_Materials/Pandas/Labs/data
df = pd.read_csv("vehicles.csv")
df

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.437500,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.437500,2550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35948,smart,fortwo coupe,2014,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,243.000000,1100
35949,smart,fortwo coupe,2015,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35950,smart,fortwo coupe,2016,0.9,3.0,Auto(AM6),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,39,36,246.000000,1100


First exploration of the dataset:

1. How many observations does it have?
2. Look at all the columns: do you understand what they mean?
3. Look at the raw data: do you see anything weird?
4. Look at the data types: are they the expected ones for the information the column contains?

In [21]:
# 1.
len(df)

35952

In [None]:
# 2. so far so good

In [58]:
# 3. if any NaN entries?
df.isnull().values.any()

False

In [57]:
# 4. all are objects!
df.dtypes

Make                        object
Model                       object
Year                         int64
Engine Displacement        float64
Cylinders                  float64
Transmission                object
Drivetrain                  object
Vehicle Class               object
Fuel Type                   object
Fuel Barrels/Year          float64
City MPG                     int64
Highway MPG                  int64
Combined MPG                 int64
CO2 Emission Grams/Mile    float64
Fuel Cost/Year               int64
dtype: object

### Cleaning and wrangling data

#### 1. Some car brand names refer to the same brand. Replace all brand names that contain the word "Dutton" for simply "Dutton". If you find similar examples, clean their names too. Use `loc` with boolean indexing.

In [10]:
# a. find the rows first
dutton_only = df[df['Make'].str.contains('Dutton')]
print(dutton_only)

         Make               Model  Year  Engine Displacement  Cylinders  \
11012  Dutton       Funeral Coach  1985                  4.1        8.0   
30164  Dutton  Funeral Coach  2WD  1984                  6.0        8.0   
31754  Dutton  Funeral Coach  2WD  1984                  6.0        8.0   

          Transmission         Drivetrain                Vehicle Class  \
11012  Automatic 4-spd  Front-Wheel Drive     Special Purpose Vehicles   
30164  Automatic 3-spd      2-Wheel Drive  Special Purpose Vehicle 2WD   
31754  Automatic 3-spd      2-Wheel Drive  Special Purpose Vehicle 2WD   

      Fuel Type  Fuel Barrels/Year  City MPG  Highway MPG  Combined MPG  \
11012   Regular          19.388824        15           21            17   
30164   Regular          32.961000         9           11            10   
31754   Regular          32.961000        10           11            10   

       CO2 Emission Grams/Mile  Fuel Cost/Year  
11012               522.764706            1950  
301

In [73]:
# b. rename the found
df['Make'] = df['Make'].replace(['E. P. Dutton, Inc.', 'S and S Coach Company  E.p. Dutton', 'Superior Coaches Div E.p. Dutton'], 'Dutton')
dutton_only = df[df['Make'].str.contains('Dutton')]
print(dutton_only)

         Make               Model  Year  Engine Displacement  Cylinders  \
11012  Dutton       Funeral Coach  1985                  4.1        8.0   
30164  Dutton  Funeral Coach  2WD  1984                  6.0        8.0   
31754  Dutton  Funeral Coach  2WD  1984                  6.0        8.0   

          Transmission         Drivetrain                Vehicle Class  \
11012  Automatic 4-spd  Front-Wheel Drive     Special Purpose Vehicles   
30164  Automatic 3-spd      2-Wheel Drive  Special Purpose Vehicle 2WD   
31754  Automatic 3-spd      2-Wheel Drive  Special Purpose Vehicle 2WD   

      Fuel Type  Fuel Barrels/Year  City MPG  Highway MPG  Combined MPG  \
11012   Regular          19.388824        15           21            17   
30164   Regular          32.961000         9           11            10   
31754   Regular          32.961000        10           11            10   

       CO2 Emission Grams/Mile  Fuel Cost/Year  
11012               522.764706            1950  
301

#### 2. Convert CO2 Emissions from Grams/Mile to Grams/Km
1 Mile = 1.60934 Km


In [9]:
df['CO2 Emission Grams/Mile'] = df['CO2 Emission Grams/Mile'].apply(lambda x: (x * 1.60934) / 1.60934)
df

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,Automatic 3-spd,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.437500,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,Automatic 3-spd,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,Automatic 4-spd,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.437500,2550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35948,smart,fortwo coupe,2014,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,243.000000,1100
35949,smart,fortwo coupe,2015,1.0,3.0,Auto(AM5),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35950,smart,fortwo coupe,2016,0.9,3.0,Auto(AM6),Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,39,36,246.000000,1100


#### 3. Create a binary column that solely indicates if the transmission of a car is automatic or manual. Use `pandas.Series.str.startswith` 

In [59]:
# see all different observation in column
df['Transmission'].value_counts()

Automatic 4-spd                     10585
Manual 5-spd                         7787
Automatic (S6)                       2631
Automatic 3-spd                      2597
Manual 6-spd                         2423
Automatic 5-spd                      2171
Automatic 6-spd                      1432
Manual 4-spd                         1306
Automatic (S8)                        960
Automatic (S5)                        822
Automatic (variable gear ratios)      675
Automatic 7-spd                       662
Automatic (S7)                        261
Auto(AM-S7)                           256
Automatic 8-spd                       243
Automatic (S4)                        229
Auto(AM7)                             157
Auto(AV-S6)                           145
Auto(AM6)                             110
Auto(AM-S6)                            92
Automatic 9-spd                        90
Manual 3-spd                           74
Manual 7-spd                           68
Auto(AV-S7)                       

In [74]:
# Ok, I know how it goes
df['Transmission'] = df['Transmission'].astype(str).apply(lambda x: 0 \
       if x.startswith(('A', 'a')) \
       else (1 if x.startswith(('M', 'm')) \
             else x))

In [75]:
df['Transmission'].value_counts()

0    24290
1    11662
Name: Transmission, dtype: int64

#### 4. convert MPG columns to km_per_liter
1 Gallon = 3.78541 Liters

In [76]:
df

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,0,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,18,17,17,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,0,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,0,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,16,17,16,555.437500,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,0,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,13,13,13,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,0,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,14,21,16,555.437500,2550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,0,Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35948,smart,fortwo coupe,2014,1.0,3.0,0,Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,243.000000,1100
35949,smart,fortwo coupe,2015,1.0,3.0,0,Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,38,36,244.000000,1100
35950,smart,fortwo coupe,2016,0.9,3.0,0,Rear-Wheel Drive,Two Seaters,Premium,9.155833,34,39,36,246.000000,1100


In [77]:
df['Highway MPG'] = df['Highway MPG'].apply(lambda x : x * 3.78541)
df['Combined MPG'] = df['Combined MPG'].apply(lambda x : x * 3.78541)
df['City MPG'] = df['City MPG'].apply(lambda x : x * 3.78541)

In [78]:
df

Unnamed: 0,Make,Model,Year,Engine Displacement,Cylinders,Transmission,Drivetrain,Vehicle Class,Fuel Type,Fuel Barrels/Year,City MPG,Highway MPG,Combined MPG,CO2 Emission Grams/Mile,Fuel Cost/Year
0,AM General,DJ Po Vehicle 2WD,1984,2.5,4.0,0,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,19.388824,68.13738,64.35197,64.35197,522.764706,1950
1,AM General,FJ8c Post Office,1984,4.2,6.0,0,2-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,49.21033,49.21033,49.21033,683.615385,2550
2,AM General,Post Office DJ5 2WD,1985,2.5,4.0,0,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,20.600625,60.56656,64.35197,60.56656,555.437500,2100
3,AM General,Post Office DJ8 2WD,1985,4.2,6.0,0,Rear-Wheel Drive,Special Purpose Vehicle 2WD,Regular,25.354615,49.21033,49.21033,49.21033,683.615385,2550
4,ASC Incorporated,GNX,1987,3.8,6.0,0,Rear-Wheel Drive,Midsize Cars,Premium,20.600625,52.99574,79.49361,60.56656,555.437500,2550
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35947,smart,fortwo coupe,2013,1.0,3.0,0,Rear-Wheel Drive,Two Seaters,Premium,9.155833,128.70394,143.84558,136.27476,244.000000,1100
35948,smart,fortwo coupe,2014,1.0,3.0,0,Rear-Wheel Drive,Two Seaters,Premium,9.155833,128.70394,143.84558,136.27476,243.000000,1100
35949,smart,fortwo coupe,2015,1.0,3.0,0,Rear-Wheel Drive,Two Seaters,Premium,9.155833,128.70394,143.84558,136.27476,244.000000,1100
35950,smart,fortwo coupe,2016,0.9,3.0,0,Rear-Wheel Drive,Two Seaters,Premium,9.155833,128.70394,147.63099,136.27476,246.000000,1100


### Gathering insights:

#### 1. How many car makers are there? How many models? Which car maker has the most cars in the dataset?


In [79]:
df.columns

Index(['Make', 'Model', 'Year', 'Engine Displacement', 'Cylinders',
       'Transmission', 'Drivetrain', 'Vehicle Class', 'Fuel Type',
       'Fuel Barrels/Year', 'City MPG', 'Highway MPG', 'Combined MPG',
       'CO2 Emission Grams/Mile', 'Fuel Cost/Year'],
      dtype='object')

In [85]:
df['Make'].nunique() # there 125 car makers

125

In [86]:
df['Model'].nunique() # 3608 models

3608

In [90]:
df['Make'].mode() # Chevrolet

0    Chevrolet
Name: Make, dtype: object

#### 2. When were these cars made? How big is the engine of these cars?


In [98]:
# A subset of Chevrolets only
df_chevrolet = df[df['Make'] == 'Chevrolet'] 

In [104]:
df['Year'].describe() # between 2000 and 2017

count    35952.00000
mean      2000.71640
std         10.08529
min       1984.00000
25%       1991.00000
50%       2001.00000
75%       2010.00000
max       2017.00000
Name: Year, dtype: float64

In [108]:
# Mean
df_chevrolet['Engine Displacement'].mean() 
# 4.1

# StDev
df_chevrolet['Engine Displacement'].std() 
# 1.48

# Median
df_chevrolet['Engine Displacement'].median() 
# 4.3

4.111501509744611

1.4811026443286053

4.3

#### 3. What's the frequency of different transmissions, drivetrains and fuel types?

In [115]:
df.columns

Index(['Make', 'Model', 'Year', 'Engine Displacement', 'Cylinders',
       'Transmission', 'Drivetrain', 'Vehicle Class', 'Fuel Type',
       'Fuel Barrels/Year', 'City MPG', 'Highway MPG', 'Combined MPG',
       'CO2 Emission Grams/Mile', 'Fuel Cost/Year'],
      dtype='object')

In [114]:
df['Transmission'].value_counts()

0    24290
1    11662
Name: Transmission, dtype: int64

In [116]:
df['Drivetrain'].value_counts()

Front-Wheel Drive             13044
Rear-Wheel Drive              12726
4-Wheel or All-Wheel Drive     6503
All-Wheel Drive                2039
4-Wheel Drive                  1058
2-Wheel Drive                   423
Part-time 4-Wheel Drive         158
2-Wheel Drive, Front              1
Name: Drivetrain, dtype: int64

In [117]:
df['Fuel Type'].value_counts()

Regular                        23587
Premium                         9921
Gasoline or E85                 1195
Diesel                           911
Premium or E85                   121
Midgrade                          74
CNG                               60
Premium and Electricity           20
Gasoline or natural gas           20
Premium Gas or Electricity        17
Regular Gas and Electricity       16
Gasoline or propane                8
Regular Gas or Electricity         2
Name: Fuel Type, dtype: int64

#### 4. What's the car that consumes the least/most fuel?

In [123]:
df['Combined MPG'].argmin()
# 20894

df.iloc[20894]
# Lamborghini Countach

20894

Make                            Lamborghini
Model                              Countach
Year                                   1986
Engine Displacement                     5.2
Cylinders                              12.0
Transmission                              1
Drivetrain                 Rear-Wheel Drive
Vehicle Class                   Two Seaters
Fuel Type                           Premium
Fuel Barrels/Year                 47.087143
City MPG                           22.71246
Highway MPG                         37.8541
Combined MPG                       26.49787
CO2 Emission Grams/Mile         1269.571429
Fuel Cost/Year                         5800
Name: 20894, dtype: object

In [127]:
df['Combined MPG'].argmax()
df.iloc[33279]

33279

Make                                  Toyota
Model                              Prius Eco
Year                                    2016
Engine Displacement                      1.8
Cylinders                                4.0
Transmission                               0
Drivetrain                 Front-Wheel Drive
Vehicle Class                   Midsize Cars
Fuel Type                            Regular
Fuel Barrels/Year                   5.885893
City MPG                           219.55378
Highway MPG                        200.62673
Combined MPG                       211.98296
CO2 Emission Grams/Mile                158.0
Fuel Cost/Year                           600
Name: 33279, dtype: object

<b> (Optional)

What brand has the worse CO2 Emissions on average?

Hint: use the function `sort_values()`

In [28]:
## your Code here


Do cars with automatic transmission consume more fuel than cars with manual transmission on average?

In [20]:
## Your Code is here 
