# FIT5196 Task 1 in Assessment 2
#### Student Name: Chuangfu Xie
#### Student ID: 27771539

Date: 28/04/2018

Version: 1.0

Environment: Pandas 0.22.0, Python 3.6.4 and Jupyter notebook

In [None]:
import sys
print (sys.version_info)

## 1.  Import libraries 

In [None]:
import pandas as pd
import numpy as np
import re
print("Your Pandas version: " + pd.__version__)

## 2. Auditing

First, read CSV file from current directory by `read_csv()`:

In [None]:
df = pd.read_csv("./dataset1_with_error.csv")
# Take a peek at data 
df.head()

Since <font color='blue'>**Id**</font> are identifiers of these data, let's check whether it contains duplicated records:

In [None]:
sum(df['Id'].duplicated())

No duplicated record is great. Then, let's have a overview of this dataframe:

In [None]:
# Check overall info
df.info()

### 2.1 Check on Date data: `OpenDate` and `CloseDate`

In this section, we intend to find out **whether any data contains anomalies** in <font color='blue'>**OpenDate**</font> and <font color='blue'>**CloseDate**</font>. It can be syntactical anomalies, semantic anomalies, or coverage anomalies. From output above, we can find that value stored in these two colums are both in `object` type. However, if we want to check whether the date are valid, we need to convert these data from `object` type to `datetime` type.  
First. let's start from its **format**:

In [None]:
def check_date(data, col):
    '''
    With common sense, this function applies constraints on each data, 
    then return a list of indices of all errors.
    > Arguments:
    data: DataFrame. Pandas.DataFrame object
    col: str. Target column, 'OpenDate', 'CloseDate'
    > Return:
    error_list: a list of indices of all errors
    '''
    #initialise a set for storing indices(avoid duplicate)
    errors = []
    # Check every record
    for i,each in enumerate(data[col]):
        try:
            pd.to_datetime(each)
        except ValueError:
            errors.append(i)
    return errors

error_list = check_date(df, 'OpenDate')
# Have a look at those error
df.iloc[error_list]['OpenDate']

As shown above, these **10 records** fail to convert into `datetime` type, which means these data format are <font color='red'>**inconsistent**</font> with normal date value format. We should invert the position of day and month.  
With `check_date` function, now we have anomalies index. Let's create another function as `rectify_date` to rectify these errors:

In [None]:
def rectify_date(data, col):
    '''
    This function is to rectify date format error in place.
    > Arguments:
    data: DataFrame. Pandas.DataFrame object
    col: str. Target column, 'OpenDate', 'CloseDate'
    '''
    re_pattern = r'^(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})(?P<time>T\d{6})'
    error_list = check_date(data, col)
    for i in error_list:
        datedict = re.match(re_pattern, data[col][i]).groupdict()
        # construct replaced date
        rep_d = datedict['year']+datedict['day']+datedict['month']+datedict['time']
        # rectify format in place.
        data.at[i,col] = rep_d

# Rectify data in 'OpenDate' and 'CloseDate'
rectify_date(df, 'OpenDate')
# Double-check on those errors
df.iloc[error_list]['OpenDate']

In [None]:
# Apply same process on 'CloseDate'
rectify_date(df,'CloseDate')

Having all date in correct format, now we can check whether data spread in <font color='blue'>**OpenDate**</font> and <font color='blue'>**CloseDate**</font> have **violate the integrity constraint**: the date value on <font color='blue'>**OpenDate**</font> should be **preceding to** that on <font color='blue'>**CloseDate**</font>.    
To find out if there any, we first need to convert <font color='blue'>**OpenDate**</font> and <font color='blue'>**CloseDate**</font> into `datatime` type.

In [None]:
df['OpenDate'] = pd.to_datetime(df['OpenDate'])
df['CloseDate'] = pd.to_datetime(df['CloseDate'])
# Any violation?
df[df['OpenDate']>df['CloseDate']].loc[:,['OpenDate','CloseDate']]

We have found **8 violations**. It seems like data in <font color='blue'>**OpenDate**</font> have been carelessly put into <font color='blue'>**CloseDate**</font> column. Let's change it back:

In [None]:
open_d = df[df['OpenDate']>df['CloseDate']]['OpenDate']
close_d = df[df['OpenDate']>df['CloseDate']]['CloseDate']
error_list = df[df['OpenDate']>df['CloseDate']].index.tolist()
for i,o,c in zip(error_list, open_d, close_d):
    df.at[i,"OpenDate"]=c
    df.at[i,"CloseDate"]=o
# Double-check
sum(df['OpenDate']>df['CloseDate'])

### 2.2 Checking on numeric data: `Salary per annum`

The aforementioned process have shown how to rectify anomalies occour in date value, now we turn to numeric data: <font color='blue'>**Salary per annum**</font>.  
Before we closely examine these data, we know data in column <font color='blue'>**Salary per annum**</font> are stored in `object` type. let's define a function to check whether data are in numberic format by conversion function `int()`:

In [None]:
def check_numeric(data, col):
    errors=[]
    for i,d in enumerate(data[col]):
        try:
            int(d)
        except ValueError:
            errors.append(i)
    return errors

alist = check_numeric(df, 'Salary per annum')
df.iloc[alist]

Now, we can easily find there are 3 different format in column <font color='blue'>**Salary per annum**</font>:

```Python
'24996'	                <-- numeric
'22K'                      <-- 'K' abbrevated for 1000
'24646.8 - 27241.2'        <-- givn a range
```
These irregularities need to be uniformed. Especially, as to those range data, we use the mean of that range as the substitution data.  
Here, we assume that after cleansing we only need interger value in salary. All data after processing by function `extract_salary()` would be converted to integer.

In [None]:
def extract_salary(data, col):
    range_pattern = r'(\d+.\d+) - (\d+.\d+)'
    k_pattern = r'(\d+)K'
    errors = check_numeric(data,col)
    for i,s in zip(errors, data.iloc[errors][col]):
        if 'K' in s:
            rep_n = 1000*int(re.match(k_pattern, s).group(1))
            data.at[i,col] = rep_n
        else:
            rep_n = 0.5*(float(re.match(range_pattern, s).group(1)) + float(re.match(range_pattern, s).group(2)))
            data.at[i,col] = int(round(rep_n))

In [None]:
extract_salary(df, 'Salary per annum')
# Convert 'Salary per annum' to numeric type
df['Salary per annum'] = pd.to_numeric(df['Salary per annum'])

### 2.3 Checking on others columns

In [None]:
df.info()

Previously we have done our anormalies detection in column <font color='blue'>**OpenDate**</font>, <font color='blue'>**CloseDate**</font> and <font color='blue'>**Salary per annum**</font>. Now let's have a look at other columns:

In [None]:
df.ContractType.value_counts()

Column <font color='blue'>**ContractType**</font> only have 3 possible value: `not available`, `full_time`, `part_time`. 
However, the underscore "\_" seem to be some typing error, we need to replace it with dash "-":

In [None]:
def rectify_contractType(data):
    for i,t in enumerate(data['ContractType']):
        if '_' in t:
            rep = t.replace('_','-')
            data.at[i,'ContractType'] = rep
rectify_contractType(df)
df.ContractType.value_counts()

Better.  
Then, let's look at column <font color='blue'>**ContractTime**</font>:

In [None]:
df.ContractTime.value_counts()

Nothing special. How about data in column <font color='blue'>**Category**</font>:

In [None]:
df.Category.value_counts()

How about column <font color='blue'>**Title**</font>, <font color='blue'>**Location**</font>, <font color='blue'>**Company**</font> and <font color='blue'>**SourceName**</font>?  
Data in these corresponding column are naturally diverse with seldom similarity, we just keep what they are.

## 3. Export output

Export the dataframe to `CSV` without indexing:

In [None]:
df.to_csv('./dataset1_solution.csv', index=False)

## 4. Summary

Following the instructions, I have find data problems included:
1. Lexical errors: 'full_time' instead of 'full-time' in  <font color='blue'>**ContractType**</font>.  
2. Irregularities: '24996','22K','24646.8 - 27241.2' in <font color='blue'>**Salary per annum**</font>.
3. Violation of the Integrity constraint: <font color='blue'>**CloseDate**</font> preceding to <font color='blue'>**OpenDate**</font>.
4. Inconsistency: '20180428T150000' against '20182804T150000' in <font color='blue'>**OpenDate**</font>