## Pre-proccessing data

In [1]:
import pandas as pd
import numpy as np
url_from = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
df_cars = pd.read_csv(url_from,header=None)

In [2]:
headres = [ 'symboling', 'normalized-losses', 'make', 'fuel-type' , ' aspiration',
           'num-of-doors','body-style', 'drive-wheels', 'engine-location' , 'wheel-base',
           'length','width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 
           'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm','city-mpg', 'highway-mpg','price']
df_cars.columns = headres

It is the process of converting or mapping data from one “raw” form into another
format to make it ready for further analysis.
Data pre-processing is also often called “data cleaning” or “data wrangling”, and there
are likely other terms.

In [3]:
# 1st - remove datas with missing values (NaN) - dropna - “axis=0” to drop the rows, or “axis=1” to drop the columns
# Setting the argument “inplace” to “True” allows the modification to be done on the dataset directly.
df_cars = df_cars.replace(to_replace='?',value=np.nan)
df_cars.dropna(subset=['price'], axis=0, inplace=True)
# teraz kopia dla obliczenia sredniej z 'normalized-losses'
#df_nl = df_cars.dropna(subset=['normalized-losses'], axis=0)


In [4]:
# changing '?' to NaN
# To replace missing values like NaNs with actual values, pandas library has a built in method called ‘replace’, 
# which can be used to fill in the missing values with the newly calculated values.
df_cars_mean = df_cars.replace(to_replace='?',value=np.nan)
df_cars_mean.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


** As an example, assume that we want to replace the missing values of the variable ‘normalized-losses’
by the mean value of the variable. **

In [5]:
#   changing data type of column
df_series = pd.to_numeric(df_cars_mean['normalized-losses'])
df_cars_mean['normalized-losses'] = df_series
df_cars_mean.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [6]:
# 1. calculate mean
mean = df_cars_mean['normalized-losses'].mean()
mean

122.0

In [7]:
df_cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [8]:
df_cars.mean()

symboling               0.840796
wheel-base             98.797015
length                174.200995
width                  65.889055
height                 53.766667
curb-weight          2555.666667
engine-size           126.875622
compression-ratio      10.164279
city-mpg               25.179104
highway-mpg            30.686567
price                        inf
dtype: float64

In [19]:
df_cars['normalized-losses'].fillna(value=mean, inplace=True)
df_cars.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-l/km,highway-mpg,price
0,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,11.190476,27,13495.0
1,3,122,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,11.190476,27,16500.0
2,1,122,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,12.368421,26,16500.0
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,9.791667,30,13950.0
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,13.055556,22,17450.0


## Converts city-mg from galons/mile to liter/100km

1st - look at clumn city-mpg

In [10]:
df_cars.loc[:,'city-mpg'].head()

0    21
1    21
2    19
3    24
4    18
Name: city-mpg, dtype: int64

In [11]:
df_cars['city-mpg'].head()

0    21
1    21
2    19
3    24
4    18
Name: city-mpg, dtype: int64

In [12]:
df_cars['city-mpg'] = 235/df_cars['city-mpg']

In [13]:
df_cars.loc[:,'city-mpg'].head()

0    11.190476
1    11.190476
2    12.368421
3     9.791667
4    13.055556
Name: city-mpg, dtype: float64

In [14]:
# changing column name
df_cars.rename(columns = {'city-mpg':'city-l/km'}, inplace=True)


In [15]:
df_cars.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 204
Data columns (total 26 columns):
symboling            201 non-null int64
normalized-losses    201 non-null object
make                 201 non-null object
fuel-type            201 non-null object
 aspiration          201 non-null object
num-of-doors         201 non-null object
body-style           201 non-null object
drive-wheels         201 non-null object
engine-location      201 non-null object
wheel-base           201 non-null float64
length               201 non-null float64
width                201 non-null float64
height               201 non-null float64
curb-weight          201 non-null int64
engine-type          201 non-null object
num-of-cylinders     201 non-null object
engine-size          201 non-null int64
fuel-system          201 non-null object
bore                 201 non-null object
stroke               201 non-null object
compression-ratio    201 non-null float64
horsepower           201 non-nul

In [16]:
df_cars['price'] = df_cars['price'].astype("int")

In [17]:
df_cars['price'] = df_cars['price'].astype("float")

In [18]:
df_cars.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 204
Data columns (total 26 columns):
symboling            201 non-null int64
normalized-losses    201 non-null object
make                 201 non-null object
fuel-type            201 non-null object
 aspiration          201 non-null object
num-of-doors         201 non-null object
body-style           201 non-null object
drive-wheels         201 non-null object
engine-location      201 non-null object
wheel-base           201 non-null float64
length               201 non-null float64
width                201 non-null float64
height               201 non-null float64
curb-weight          201 non-null int64
engine-type          201 non-null object
num-of-cylinders     201 non-null object
engine-size          201 non-null int64
fuel-system          201 non-null object
bore                 201 non-null object
stroke               201 non-null object
compression-ratio    201 non-null float64
horsepower           201 non-nul

## normalization, an important technique to understand in data pre-processing.

## Simple future scaling

$ x_{new} =  \frac{ x_{old} }{x_{max}} $

This makes the new values range between 0 and 1.

## Min-Max method

$ x_{new} =  \frac{ x_{old} - x_{min} }{x_{max} - x_{min}} $

Again, the resulting new values range between 0 and 1.

## z-score or standard score method

$ x_{new} =  \frac{ x_{old} - \mu }{ \sigma } $

$ \mu - Average $

$ \sigma - Standard \ deviation $

The resulting values hover around 0, and typically range between -3 and +3, but can be higher or lower.