# Data cleaning and preprocessing with Pandas

This jupyter notebook presents some data maniupulation, cleaning and preprocessing functions, which might be useful when dealing with datasets. It includes:

- creating pandas DataFrame
- selecting, modifying, removing rows/columns
- descriptive statistics
- removing missing or invalid data

## Creating Pandas DataFrame
### Importing Pandas

In [156]:
import pandas as pd
import numpy as np

### Importing data from files
The ** *read_csv()* ** function enables to read CSV files, however pandas enables to read other formats e.g. excel files. Below, is an exemplary code snipper which reads the file *data.csv* and loads it into pandas DataFrame.

Useful arguments of the function:
- ** *sep* ** specifies the separator with which cells are separated with, 
- *** encoding* ** enables to change the file encoding. A popular format is *utf-8*, however you might try *ISO-8859-1* if utf-8 does not work.

*Please not that in order to open a file you need to have it in your directory*

In [157]:
dataFrame = pd.read_csv('./data.csv', sep=',', encoding='utf-8')

### Creating DataFrame from a python dictionary

In [158]:
data = [{'color_name': 'black', 'R': 0, 'G': 0, 'B': 0}, {'color_name': 'white', 'R': 255, 'G': 255, 'B': 255}, 
        {'color_name': 'red', 'R': 255, 'G': 0, 'B': 0}, {'color_name': 'blue', 'R': 0, 'G': 0, 'B': 255},
        {'color_name': 'green', 'R': 0, 'G': 255, 'B': 0}]

dataFrame = pd.DataFrame(data)

### Inspecting the dataset

To inspect the dataset use ** *head()* ** function. This will show you first 5 rows of the dataset.

In [159]:
dataFrame.head()

Unnamed: 0,B,G,R,color_name
0,0,0,0,black
1,255,255,255,white
2,0,0,255,red
3,255,0,0,blue
4,0,255,0,green


## Selecting, modifying and removing rows/columns
### Selecting a column
To select a column use its name:

In [160]:
dataFrame['B']

0      0
1    255
2      0
3    255
4      0
Name: B, dtype: int64

In [161]:
dataFrame['color_name']

0    black
1    white
2      red
3     blue
4    green
Name: color_name, dtype: object

### Selecting rows
To select range of rows use the code below. DataFrame[2:4] selects all columns for rows from 2 to 3 (4 is not included)

In [162]:
dataFrame[2:4]

Unnamed: 0,B,G,R,color_name
2,0,0,255,red
3,255,0,0,blue


If you want to select specific row indexes use ** *iloc* **:

In [163]:
dataFrame.iloc[[0,2]]

Unnamed: 0,B,G,R,color_name
0,0,0,0,black
2,0,0,255,red


You can also select first e.g. 2 rows of the dataset, like below:

In [164]:
dataFrame[:2]

Unnamed: 0,B,G,R,color_name
0,0,0,0,black
1,255,255,255,white


Or last rows

In [165]:
dataFrame[-2:]

Unnamed: 0,B,G,R,color_name
3,255,0,0,blue
4,0,255,0,green


### Selecting multiple columns
You can select multiple columns by passing it as an array of column names, just like below:

In [166]:
column_slice = dataFrame[['B','color_name']]
column_slice.head()

Unnamed: 0,B,color_name
0,0,black
1,255,white
2,0,red
3,255,blue
4,0,green


### Column names and indices
You can return column names by using:

In [167]:
dataFrame.columns

Index(['B', 'G', 'R', 'color_name'], dtype='object')

You can see the row numbers/names by using:

In [168]:
dataFrame.index

RangeIndex(start=0, stop=5, step=1)

### Conditional selection
You can apply conditional selection by using ** dataFrame[*column_name*] *condition* *value* ** as specified below. The result of such expression is a True/False data frame which can be used to display rows which fulfil the condition. For example, the code below selects all rows that have *R* higher than 100.

In [169]:
select_greater_than_100 = (dataFrame['R'] > 100)
dataFrame[select_greater_than_100]

Unnamed: 0,B,G,R,color_name
1,255,255,255,white
2,0,0,255,red


### Creating new columns
You can create a new column just by assigning values to non-existent column name. You can use completely new values, or create column using values from another one, as shown below: 

In [170]:
dataFrame['color_number'] = 65536 * dataFrame['R'] + 256 * dataFrame['G'] + dataFrame['B']
dataFrame.head()

Unnamed: 0,B,G,R,color_name,color_number
0,0,0,0,black,0
1,255,255,255,white,16777215
2,0,0,255,red,16711680
3,255,0,0,blue,255
4,0,255,0,green,65280


### Removing columns
You can remove the column by using ** *del* **:

In [171]:
del dataFrame['color_number']
dataFrame.head()

Unnamed: 0,B,G,R,color_name
0,0,0,0,black
1,255,255,255,white
2,0,0,255,red
3,255,0,0,blue
4,0,255,0,green


Or simply pop it from the dataset. This operation will remove the column and return its values creating new DataFrame

In [172]:
new_dataFrame = dataFrame.pop('color_name')
print(new_dataFrame)
print()
print(dataFrame)

0    black
1    white
2      red
3     blue
4    green
Name: color_name, dtype: object

     B    G    R
0    0    0    0
1  255  255  255
2    0    0  255
3  255    0    0
4    0  255    0


and now adding it back to the dataset

In [173]:
dataFrame['color_name'] = new_dataFrame
dataFrame.head()

Unnamed: 0,B,G,R,color_name
0,0,0,0,black
1,255,255,255,white
2,0,0,255,red
3,255,0,0,blue
4,0,255,0,green


## Descriptive statistics
To inspect the numerical values of the dataset use:

In [174]:
dataFrame.describe()

Unnamed: 0,B,G,R
count,5.0,5.0,5.0
mean,102.0,102.0,102.0
std,139.669252,139.669252,139.669252
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,255.0,255.0,255.0
max,255.0,255.0,255.0


You will see how many rows are present in the dataset, the mean, standard deviatnios, minimum value, maximum value and the quartiles. If you want to see specific characteristics, there are special functions for that as well.

In [175]:
dataFrame.max()

B               255
G               255
R               255
color_name    white
dtype: object

In [176]:
dataFrame['B'].min()

0

In [177]:
dataFrame.mean()

B    102.0
G    102.0
R    102.0
dtype: float64

In [178]:
dataFrame['B'].mode()

0    0
dtype: int64

etc...

You can check out the correlation coefficients between columns of numerical values by using *corr()* function.

In [179]:
dataFrame.corr()

Unnamed: 0,B,G,R
B,1.0,0.166667,0.166667
G,0.166667,1.0,0.166667
R,0.166667,0.166667,1.0


## Data cleaning
You might want to inspect if there are any missing values in the dataset. You can do it by running *isnull()* function:

In [180]:
dataFrame.isnull().any()

B             False
G             False
R             False
color_name    False
dtype: bool

Function any() returns true for a column if any of the values in this column are True. In this case it returns False for all columns because there is no empty cell.

We can change the element of an array to null and execute the code again. We will see that the query returned True for B column because it now contains one missing value.

In [181]:
dataFrame.loc[0, 'B'] = np.nan
dataFrame.head()

Unnamed: 0,B,G,R,color_name
0,,0,0,black
1,255.0,255,255,white
2,0.0,0,255,red
3,255.0,0,0,blue
4,0.0,255,0,green


In [182]:
dataFrame.isnull().any()

B              True
G             False
R             False
color_name    False
dtype: bool

We might want to remove the empty values by running:

In [183]:
dataFrame.dropna()

Unnamed: 0,B,G,R,color_name
1,255.0,255,255,white
2,0.0,0,255,red
3,255.0,0,0,blue
4,0.0,255,0,green


Now, the row corresponding to *black* color name has been removed because it contained a null value. 

Alternatively, forward fill, backward fill, or mean values to replace the missing values can be used. Examples of these are presented below.

### Replacing missing values with the mean

In [184]:
# Reloading data and setting 'R' column in 2nd row to 'NaN'

dataFrame = pd.DataFrame(data)
dataFrame.loc[2, 'R'] = np.nan
print(dataFrame)

# Replacing missing values with the mean
dataFrame['R'] = (dataFrame['R'].replace(np.nan, dataFrame['R'].mean()))
dataFrame.head()

     B    G      R color_name
0    0    0    0.0      black
1  255  255  255.0      white
2    0    0    NaN        red
3  255    0    0.0       blue
4    0  255    0.0      green


Unnamed: 0,B,G,R,color_name
0,0,0,0.0,black
1,255,255,255.0,white
2,0,0,63.75,red
3,255,0,0.0,blue
4,0,255,0.0,green


### Replacing missing values with the value of the previous row (forward fill)

In [185]:
# Reloading data and setting 'R' column in 2nd row to 'NaN'

dataFrame = pd.DataFrame(data)
dataFrame.loc[2, 'R'] = np.nan
print(dataFrame)

# Replacing missing values with the previous value (forward fill)
dataFrame = dataFrame.fillna(method='ffill')
dataFrame.head()

     B    G      R color_name
0    0    0    0.0      black
1  255  255  255.0      white
2    0    0    NaN        red
3  255    0    0.0       blue
4    0  255    0.0      green


Unnamed: 0,B,G,R,color_name
0,0,0,0.0,black
1,255,255,255.0,white
2,0,0,255.0,red
3,255,0,0.0,blue
4,0,255,0.0,green


### Replacing missing values with the value of the next row (backward fill)

In [186]:
# Reloading data and setting 'R' column in 2nd row to 'NaN'

dataFrame = pd.DataFrame(data)
dataFrame.loc[2, 'R'] = np.nan
print(dataFrame)

# Replacing missing values with the next value (backward fill)
dataFrame = dataFrame.fillna(method='bfill')
dataFrame.head()

     B    G      R color_name
0    0    0    0.0      black
1  255  255  255.0      white
2    0    0    NaN        red
3  255    0    0.0       blue
4    0  255    0.0      green


Unnamed: 0,B,G,R,color_name
0,0,0,0.0,black
1,255,255,255.0,white
2,0,0,0.0,red
3,255,0,0.0,blue
4,0,255,0.0,green


### Invalid categorical values

If you want to inspect if the column contains invalid entries you can use *value_counts()* function, as below.

In [187]:
dataFrame['color_name'].value_counts()

red      1
blue     1
green    1
white    1
black    1
Name: color_name, dtype: int64

The same applies for numerical values

In [188]:
dataFrame['B'].value_counts()

0      3
255    2
Name: B, dtype: int64

## More functions
You can find more functions here: http://pandas.pydata.org/pandas-docs/stable/indexing.html