# Data cleaning and preprocessing with Pandas

This jupyter notebook presents some data maniupulation, cleaning and preprocessing functions, which might be useful when dealing with datasets. It includes:

- creating pandas DataFrame
- selecting, modifying, removing rows/columns
- descriptive statistics
- removing missing or invalid data

## Creating Pandas DataFrame
### Importing Pandas

In [5]:
import pandas as pd
import numpy as np

### Importing data from files
The ** *read_csv()* ** function enables to read CSV files, however pandas enables to read other formats e.g. excel files. Below, is an exemplary code snipper which reads the file *data.csv* and loads it into pandas DataFrame.

Useful arguments of the function:
- ** *sep* ** specifies the separator with which cells are separated with, 
- *** encoding* ** enables to change the file encoding. A popular format is *utf-8*, however you might try *ISO-8859-1* if utf-8 does not work.

*Please not that in order to open a file you need to have it in your directory*

In [6]:
dataFrame = pd.read_csv('./data.csv', sep=',', encoding='utf-8')

OSError: File b'./data.csv' does not exist

### Creating DataFrame from a python dictionary

In [None]:
data = [{'color_name': 'black', 'R': 0, 'G': 0, 'B': 0}, {'color_name': 'white', 'R': 255, 'G': 255, 'B': 255}, 
        {'color_name': 'red', 'R': 255, 'G': 0, 'B': 0}, {'color_name': 'blue', 'R': 0, 'G': 0, 'B': 255},
        {'color_name': 'green', 'R': 0, 'G': 255, 'B': 0}]

dataFrame = pd.DataFrame(data)

### Inspecting the dataset

To inspect the dataset use ** *head()* ** function. This will show you first 5 rows of the dataset.

In [None]:
dataFrame.head()

Please note that the design of the description changes if you just print it using the Python print() method.

In [None]:
print(dataFrame.head(10))

## Selecting, modifying and removing rows/columns
### Selecting a column
To select a column use its name:

In [None]:
dataFrame['B']

In [None]:
dataFrame['color_name']

### Selecting rows
To select range of rows use the code below. DataFrame[2:4] selects all columns for rows from 2 to 3 (4 is not included)

In [None]:
dataFrame[2:4]

If you want to select specific row indexes use ** *iloc* **:

In [None]:
dataFrame.iloc[[0,2]]

You can also select first e.g. 2 rows of the dataset, like below:

In [None]:
dataFrame[:2]

Or last rows

In [None]:
dataFrame[-2:]

### Selecting multiple columns
You can select multiple columns by passing it as an array of column names, just like below:

In [None]:
column_slice = dataFrame[['B','color_name']]
column_slice.head()

### Column names and indices
You can return column names by using:

In [None]:
dataFrame.columns

You can see the row numbers/names by using:

In [None]:
dataFrame.index

### Conditional selection
You can apply conditional selection by using ** dataFrame[*column_name*] *condition* *value* ** as specified below. The result of such expression is a True/False data frame which can be used to display rows which fulfil the condition. For example, the code below selects all rows that have *R* higher than 100.

In [None]:
select_greater_than_100 = (dataFrame['R'] > 100)
dataFrame[select_greater_than_100]

### Creating new columns
You can create a new column just by assigning values to non-existent column name. You can use completely new values, or create column using values from another one, as shown below: 

In [None]:
dataFrame['color_number'] = 65536 * dataFrame['R'] + 256 * dataFrame['G'] + dataFrame['B']
dataFrame.head()

### Removing columns
You can remove the column by using ** *del* **:

In [None]:
del dataFrame['color_number']
dataFrame.head()

Or simply pop it from the dataset. This operation will remove the column and return its values creating new DataFrame

In [None]:
new_dataFrame = dataFrame.pop('color_name')
print(new_dataFrame)
print()
print(dataFrame)

and now adding it back to the dataset

In [None]:
dataFrame['color_name'] = new_dataFrame
dataFrame.head()

## Descriptive statistics
To inspect the numerical values of the dataset use:

In [None]:
dataFrame.describe()

You will see how many rows are present in the dataset, the mean, standard deviatnios, minimum value, maximum value and the quartiles. If you want to see specific characteristics, there are special functions for that as well.

In [None]:
dataFrame.max()

In [None]:
dataFrame['B'].min()

In [None]:
dataFrame.mean()

In [None]:
dataFrame['B'].mode()

etc...

You can check out the correlation coefficients between columns of numerical values by using *corr()* function.

In [None]:
dataFrame.corr()

## Data cleaning
You might want to inspect if there are any missing values in the dataset. You can do it by running *isnull()* function:

In [None]:
dataFrame.isnull().any()

Function any() returns true for a column if any of the values in this column are True. In this case it returns False for all columns because there is no empty cell.

We can change the element of an array to null and execute the code again. We will see that the query returned True for B column because it now contains one missing value.

In [None]:
dataFrame.loc[0, 'B'] = np.nan
dataFrame.head()

In [None]:
dataFrame.isnull().any()

We might want to remove the empty values by running:

In [None]:
dataFrame.dropna()

Now, the row corresponding to *black* color name has been removed because it contained a null value. 

Alternatively, forward fill, backward fill, or mean values to replace the missing values can be used. Examples of these are presented below.

### Replacing missing values with the mean

In [None]:
# Reloading data and setting 'R' column in 2nd row to 'NaN'

dataFrame = pd.DataFrame(data)
dataFrame.loc[2, 'R'] = np.nan
print(dataFrame)

# Replacing missing values with the mean
dataFrame['R'] = (dataFrame['R'].replace(np.nan, dataFrame['R'].mean()))
dataFrame.head()

### Replacing missing values with the value of the previous row (forward fill)

In [None]:
# Reloading data and setting 'R' column in 2nd row to 'NaN'

dataFrame = pd.DataFrame(data)
dataFrame.loc[2, 'R'] = np.nan
print(dataFrame)

# Replacing missing values with the previous value (forward fill)
dataFrame = dataFrame.fillna(method='ffill')
dataFrame.head()

### Replacing missing values with the value of the next row (backward fill)

In [None]:
# Reloading data and setting 'R' column in 2nd row to 'NaN'

dataFrame = pd.DataFrame(data)
dataFrame.loc[2, 'R'] = np.nan
print(dataFrame)

# Replacing missing values with the next value (backward fill)
dataFrame = dataFrame.fillna(method='bfill')
dataFrame.head()

### Invalid categorical values

If you want to inspect if the column contains invalid entries you can use *value_counts()* function, as below.

In [None]:
dataFrame['color_name'].value_counts()

The same applies for numerical values

In [None]:
dataFrame['B'].value_counts()

## More functions
You can find more functions here: http://pandas.pydata.org/pandas-docs/stable/indexing.html