# Fixing `cyl` Data Type
- 2008: extract int from string
- 2018: convert float to int

Load datasets `data_08_v2.csv` and `data_18_v2.csv`. You should've created these data files in the previous section: *Filter, Drop Nulls, Dedupe*.

In [1]:
import pandas as pd

In [2]:
# dataset url
url_08 = ('https://raw.githubusercontent.com/bentegviz/udacity_intro_to_data_analysis/main/Case%20Study%202/data/processed/all_alpha_08.csv')
url_18 = ('https://raw.githubusercontent.com/bentegviz/udacity_intro_to_data_analysis/main/Case%20Study%202/data/processed/all_alpha_18.csv')

In [3]:
# csv to dataset
df_08 = pd.read_csv(url_08)
df_18 = pd.read_csv(url_18)
print('Load CSV Complete')

Load CSV Complete


In [5]:
# check value counts for the 2008 cyl column
df_08['Cyl'].value_counts()

(6 cyl)     864
(4 cyl)     600
(8 cyl)     533
(5 cyl)     113
(12 cyl)     60
(10 cyl)     29
(2 cyl)       4
(16 cyl)      2
Name: Cyl, dtype: int64

In [7]:
# view missing value count for each feature in 2008
df_08.isnull().sum()

Model                     0
Displ                     0
Cyl                     199
Trans                   199
Drive                    93
Fuel                      0
Sales Area                0
Stnd                      0
Underhood ID              0
Veh Class                 0
Air Pollution Score       0
FE Calc Appr            199
City MPG                199
Hwy MPG                 199
Cmb MPG                 199
Unadj Cmb MPG           199
Greenhouse Gas Score    199
SmartWay                  0
dtype: int64

In [8]:
# view missing value count for each feature in 2018
df_18.isnull().sum()

Model                   0
Displ                   2
Cyl                     2
Trans                   0
Drive                   0
Fuel                    0
Cert Region             0
Stnd                    0
Stnd Description        0
Underhood ID            0
Veh Class               0
Air Pollution Score     0
City MPG                0
Hwy MPG                 0
Cmb MPG                 0
Greenhouse Gas Score    0
SmartWay                0
Comb CO2                0
dtype: int64

In [9]:
# drop rows with any null values in both datasets
df_08.dropna(inplace=True)
df_18.dropna(inplace=True)

In [10]:
# checks if any of columns in 2008 have null values - should print False
df_08.isnull().sum().any()

False

In [11]:
# checks if any of columns in 2018 have null values - should print False
df_18.isnull().sum().any()

False

In [12]:
# print number of duplicates in 2008 and 2018 datasets
print(df_08.duplicated().sum())
print(df_18.duplicated().sum())

6
0


In [13]:
# drop duplicates in both datasets
df_08.drop_duplicates(inplace=True)
df_18.drop_duplicates(inplace=True)

In [14]:
# print number of duplicates in 2008 and 2018 datasets
print(df_08.duplicated().sum())
print(df_18.duplicated().sum())

0
0


Read [this](https://stackoverflow.com/questions/35376387/extract-int-from-string-in-pandas) to help you extract ints from strings in Pandas for the next step.

Fix Cyl to Int

In [15]:
# Extract int from strings in the 2008 cyl column

# df['B'].str.extract('(\d+)').astype(int)

df_08['Cyl'] = df_08['Cyl'].str.extract('(\d+)').astype(int)

In [16]:
# Check value counts for 2008 cyl column again to confirm the change
df_08['Cyl'].value_counts()

6     864
4     600
8     527
5     113
12     60
10     29
2       4
16      2
Name: Cyl, dtype: int64

In [18]:
# convert 2018 cyl column to int
df_18['Cyl'] = df_18['Cyl'].astype('int64')

In [19]:
df_18['Cyl'].value_counts()

4     736
6     504
8     309
3      36
12     18
5       4
16      2
Name: Cyl, dtype: int64

Fix Air Pollution to Float

In [21]:
print(df_08['Air Pollution Score'].value_counts())
print(df_08['Air Pollution Score'].value_counts())


6      1460
7       518
9.5      80
6/6      32
3        31
9        23
3/3      21
8        15
7/7      12
1         6
6/4       1
Name: Air Pollution Score, dtype: int64
6      1460
7       518
9.5      80
6/6      32
3        31
9        23
3/3      21
8        15
7/7      12
1         6
6/4       1
Name: Air Pollution Score, dtype: int64


In [23]:
# Change Air Pollition Score to Float
df_08['Air Pollution Score'] = df_08['Air Pollution Score'].str.extract('(\d+)').astype(float)
df_18['Air Pollution Score'] = df_18['Air Pollution Score'].astype('float')


In [26]:
print(df_08['Air Pollution Score'].dtypes)
print(df_18['Air Pollution Score'].dtypes)

float64
float64


Fix MPG to Float

In [29]:
# Change Air Pollition Score to Float
df_08['City MPG'] = df_08['City MPG'].str.extract('(\d+)').astype(float)
df_08['Hwy MPG'] = df_08['Hwy MPG'].str.extract('(\d+)').astype(float)
df_08['Cmb MPG'] = df_08['Cmb MPG'].str.extract('(\d+)').astype(float)

df_18['City MPG'] = df_18['City MPG'].str.extract('(\d+)').astype(float)
df_18['Hwy MPG'] = df_18['Hwy MPG'].str.extract('(\d+)').astype(float)
df_18['Cmb MPG'] = df_18['Cmb MPG'].str.extract('(\d+)').astype(float)

In [30]:
print(df_08.dtypes)
print(df_18.dtypes)

Model                    object
Displ                   float64
Cyl                       int64
Trans                    object
Drive                    object
Fuel                     object
Sales Area               object
Stnd                     object
Underhood ID             object
Veh Class                object
Air Pollution Score     float64
FE Calc Appr             object
City MPG                float64
Hwy MPG                 float64
Cmb MPG                 float64
Unadj Cmb MPG           float64
Greenhouse Gas Score     object
SmartWay                 object
dtype: object
Model                    object
Displ                   float64
Cyl                       int64
Trans                    object
Drive                    object
Fuel                     object
Cert Region              object
Stnd                     object
Stnd Description         object
Underhood ID             object
Veh Class                object
Air Pollution Score     float64
City MPG                fl

Fix Greenhouse Gas to Int

In [33]:
# Extract int from strings in the 2008 cyl column

# df['B'].str.extract('(\d+)').astype(int)

df_08['Greenhouse Gas Score'] = df_08['Greenhouse Gas Score'].str.extract('(\d+)').astype(int)
df_18['Greenhouse Gas Score'] = df_18['Greenhouse Gas Score'].str.extract('(\d+)').astype(int)

AttributeError: ignored

In [34]:
print(df_08['Greenhouse Gas Score'].dtypes)
print(df_18['Greenhouse Gas Score'].dtypes)

int64
int64


Export CSV

In [35]:
from google.colab import files
df_08.to_csv('data_08_v3.csv', index=False)
files.download('data_08_v3.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [36]:
df_18.to_csv('data_18_v3.csv', index=False)
files.download('data_18_v3.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>