# FIT5196 Task 1 in Assessment 2
#### Student Name: Chuangfu Xie
#### Student ID: 27771539

Date: 28/04/2018

Version: 1.0

Environment: Pandas 0.22.0, Python 3.6.4 and Jupyter notebook

In [1]:
import sys
print (sys.version_info)

sys.version_info(major=3, minor=6, micro=4, releaselevel='final', serial=0)


## 1.  Import libraries 

In [2]:
import pandas as pd
import numpy as np
import re
print("Your Pandas version: " + pd.__version__)

Your Pandas version: 0.22.0


## 2. Auditing

First, read CSV file from current directory by `read_csv()`:

In [3]:
df = pd.read_csv("./dataset1_with_error.csv")
# Take a peek at data 
df.head()

Unnamed: 0,Id,Title,Location,ContractType,ContractTime,Company,Category,Salary per annum,SourceName,OpenDate,CloseDate
0,12612628,Engineering Systems Analyst,Dorking,not available,permanent,Gregory Martin International,Engineering Jobs,24996,cv-library.co.uk,20121103T000000,20121203T000000
1,12612830,Stress Engineer Glasgow,Glasgow,not available,permanent,Gregory Martin International,Engineering Jobs,30000,cv-library.co.uk,20130108T150000,20130408T150000
2,12612844,Modelling and simulation analyst,Hampshire,not available,permanent,Gregory Martin International,Engineering Jobs,30000,cv-library.co.uk,20130726T150000,20130924T150000
3,12613049,Engineering Systems Analyst / Mathematical Mod...,Surrey,not available,permanent,Gregory Martin International,Engineering Jobs,27504,cv-library.co.uk,20121214T000000,20130314T000000
4,12613647,"Pioneer, Miser Engineering Systems Analyst",Surrey,not available,permanent,Gregory Martin International,Engineering Jobs,24996,cv-library.co.uk,20131025T000000,20131224T000000


Since <font color='blue'>**Id**</font> are identifiers of these data, let's check whether it contains duplicated records:

In [4]:
sum(df['Id'].duplicated())

0

No duplicated record is great. Then, let's have a overview of this dataframe:

In [5]:
# Check overall info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25077 entries, 0 to 25076
Data columns (total 11 columns):
Id                  25077 non-null int64
Title               25077 non-null object
Location            25077 non-null object
ContractType        25077 non-null object
ContractTime        25077 non-null object
Company             21242 non-null object
Category            25077 non-null object
Salary per annum    25077 non-null object
SourceName          25077 non-null object
OpenDate            25077 non-null object
CloseDate           25077 non-null object
dtypes: int64(1), object(10)
memory usage: 2.1+ MB


### 2.1 Check on Date data: `OpenDate` and `CloseDate`

In this section, we intend to find out **whether any data contains anomalies** in <font color='blue'>**OpenDate**</font> and <font color='blue'>**CloseDate**</font>. It can be syntactical anomalies, semantic anomalies, or coverage anomalies. From output above, we can find that value stored in these two colums are both in `object` type. However, if we want to check whether the date are valid, we need to convert these data from `object` type to `datetime` type.  
First. let's start from its **format**:

In [6]:
def check_date(data, col):
    '''
    With common sense, this function applies constraints on each data, 
    then return a list of indices of all errors.
    > Arguments:
    data: DataFrame. Pandas.DataFrame object
    col: str. Target column, 'OpenDate', 'CloseDate'
    > Return:
    error_list: a list of indices of all errors
    '''
    #initialise a set for storing indices(avoid duplicate)
    errors = []
    # Check every record
    for i,each in enumerate(data[col]):
        try:
            pd.to_datetime(each)
        except ValueError:
            errors.append(i)
    return errors

error_list = check_date(df, 'OpenDate')
# Have a look at those error
df.iloc[error_list]['OpenDate']

1102     20131803T120000
2104     20132606T000000
2839     20122003T150000
5707     20121512T150000
10881    20133004T150000
11948    20131908T000000
15353    20121406T120000
22918    20131509T000000
23007    20132901T150000
23169    20132108T120000
Name: OpenDate, dtype: object

As shown above, these **10 records** fail to convert into `datetime` type, which means these data format are <font color='red'>**inconsistent**</font> with normal date value format. We should invert the position of day and month.  
With `check_date` function, now we have anomalies index. Let's create another function as `rectify_date` to rectify these errors:

In [7]:
def rectify_date(data, col):
    '''
    This function is to rectify date format error in place.
    > Arguments:
    data: DataFrame. Pandas.DataFrame object
    col: str. Target column, 'OpenDate', 'CloseDate'
    '''
    re_pattern = r'^(?P<year>\d{4})(?P<month>\d{2})(?P<day>\d{2})(?P<time>T\d{6})'
    error_list = check_date(data, col)
    for i in error_list:
        datedict = re.match(re_pattern, data[col][i]).groupdict()
        # construct replaced date
        rep_d = datedict['year']+datedict['day']+datedict['month']+datedict['time']
        # rectify format in place.
        data.at[i,col] = rep_d

# Rectify data in 'OpenDate' and 'CloseDate'
rectify_date(df, 'OpenDate')
# Double-check on those errors
df.iloc[error_list]['OpenDate']

1102     20130318T120000
2104     20130626T000000
2839     20120320T150000
5707     20121215T150000
10881    20130430T150000
11948    20130819T000000
15353    20120614T120000
22918    20130915T000000
23007    20130129T150000
23169    20130821T120000
Name: OpenDate, dtype: object

In [8]:
# Apply same process on 'CloseDate'
rectify_date(df,'CloseDate')

Having all date in correct format, now we can check whether data spread in <font color='blue'>**OpenDate**</font> and <font color='blue'>**CloseDate**</font> have **violate the integrity constraint**: the date value on <font color='blue'>**OpenDate**</font> should be **preceding to** that on <font color='blue'>**CloseDate**</font>.    
To find out if there any, we first need to convert <font color='blue'>**OpenDate**</font> and <font color='blue'>**CloseDate**</font> into `datatime` type.

In [9]:
df['OpenDate'] = pd.to_datetime(df['OpenDate'])
df['CloseDate'] = pd.to_datetime(df['CloseDate'])
# Any violation?
df[df['OpenDate']>df['CloseDate']].loc[:,['OpenDate','CloseDate']]

Unnamed: 0,OpenDate,CloseDate
6659,2013-07-03 00:00:00,2013-04-04 00:00:00
9568,2012-03-06 00:00:00,2012-02-05 00:00:00
11043,2013-11-05 00:00:00,2013-10-06 00:00:00
12473,2013-12-22 15:00:00,2013-10-23 15:00:00
19142,2012-08-03 15:00:00,2012-05-05 15:00:00
24206,2013-06-22 15:00:00,2013-03-24 15:00:00
24297,2013-01-16 00:00:00,2012-10-18 00:00:00
25039,2013-07-08 12:00:00,2013-06-08 12:00:00


We have found **8 violations**. It seems like data in <font color='blue'>**OpenDate**</font> have been carelessly put into <font color='blue'>**CloseDate**</font> column. Let's change it back:

In [10]:
open_d = df[df['OpenDate']>df['CloseDate']]['OpenDate']
close_d = df[df['OpenDate']>df['CloseDate']]['CloseDate']
error_list = df[df['OpenDate']>df['CloseDate']].index.tolist()
for i,o,c in zip(error_list, open_d, close_d):
    df.at[i,"OpenDate"]=c
    df.at[i,"CloseDate"]=o
# Double-check
sum(df['OpenDate']>df['CloseDate'])

0

### 2.2 Checking on numeric data: `Salary per annum`

The aforementioned process have shown how to rectify anomalies occour in date value, now we turn to numeric data: <font color='blue'>**Salary per annum**</font>.  
Before we closely examine these data, we know data in column <font color='blue'>**Salary per annum**</font> are stored in `object` type. let's define a function to check whether data are in numberic format by conversion function `int()`:

In [11]:
def check_numeric(data, col):
    errors=[]
    for i,d in enumerate(data[col]):
        try:
            int(d)
        except ValueError:
            errors.append(i)
    return errors

alist = check_numeric(df, 'Salary per annum')
df.iloc[alist]

Unnamed: 0,Id,Title,Location,ContractType,ContractTime,Company,Category,Salary per annum,SourceName,OpenDate,CloseDate
133,46626860,Registered Home Manager Job Bournemouth,Bournemouth,full_time,not available,,Healthcare & Nursing Jobs,30K,careworx.co.uk,2013-03-29 15:00:00,2013-05-28 15:00:00
238,46627928,Care Home Manager Job North London ****K,London,full_time,not available,,Healthcare & Nursing Jobs,38K,careworx.co.uk,2012-11-07 15:00:00,2013-02-05 15:00:00
305,46628805,Home Care Workers Berkhampsted,UK,part_time,not available,,Healthcare & Nursing Jobs,14K,careworx.co.uk,2012-09-05 15:00:00,2012-11-04 15:00:00
596,46634306,Home Manager Mental Health 11 bed,Wales,not available,not available,,Healthcare & Nursing Jobs,24K,careworx.co.uk,2012-09-13 12:00:00,2012-11-12 12:00:00
647,46634923,RGN Nurse Hull Days or Nights **** per hour,UK,full_time,not available,,Healthcare & Nursing Jobs,20896.2 - 23095.8,careworx.co.uk,2013-06-05 15:00:00,2013-09-03 15:00:00
830,46637596,Staff Nurse South Shields ****,South Shields,not available,not available,,Healthcare & Nursing Jobs,23K,careworx.co.uk,2012-04-29 15:00:00,2012-05-13 15:00:00
896,48082563,Senior Chef de Partie One AA Rosette Hotel T...,Cumbria,not available,not available,Chef Results,Hospitality & Catering Jobs,16153.8 - 17854.2,caterer.com,2013-01-09 15:00:00,2013-01-23 15:00:00
952,49065458,Service Manager Learning Disabilities,Wales,not available,not available,,Healthcare & Nursing Jobs,25K,careworx.co.uk,2013-01-08 15:00:00,2013-01-22 15:00:00
980,49689021,Assessment Officer,Kent,not available,not available,,Healthcare & Nursing Jobs,23712.0 - 26208.0,careworx.co.uk,2012-02-20 00:00:00,2012-03-21 00:00:00
996,49845058,Chef de Partie Fresh Food Pub Good Reputatio...,Lancashire,not available,not available,Chef Results,Hospitality & Catering Jobs,16K,caterer.com,2012-06-01 15:00:00,2012-07-01 15:00:00


Now, we can easily find there are 3 different format in column <font color='blue'>**Salary per annum**</font>:

```Python
'24996'	                <-- numeric
'22K'                      <-- 'K' abbrevated for 1000
'24646.8 - 27241.2'        <-- givn a range
```
These irregularities need to be uniformed. Especially, as to those range data, we use the mean of that range as the substitution data.  
Here, we assume that after cleansing we only need interger value in salary. All data after processing by function `extract_salary()` would be converted to integer.

In [12]:
def extract_salary(data, col):
    range_pattern = r'(\d+.\d+) - (\d+.\d+)'
    k_pattern = r'(\d+)K'
    errors = check_numeric(data,col)
    for i,s in zip(errors, data.iloc[errors][col]):
        if 'K' in s:
            rep_n = 1000*int(re.match(k_pattern, s).group(1))
            data.at[i,col] = rep_n
        else:
            rep_n = 0.5*(float(re.match(range_pattern, s).group(1)) + float(re.match(range_pattern, s).group(2)))
            data.at[i,col] = int(round(rep_n))

In [13]:
extract_salary(df, 'Salary per annum')
# Convert 'Salary per annum' to numeric type
df['Salary per annum'] = pd.to_numeric(df['Salary per annum'])

### 2.3 Checking on others columns

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25077 entries, 0 to 25076
Data columns (total 11 columns):
Id                  25077 non-null int64
Title               25077 non-null object
Location            25077 non-null object
ContractType        25077 non-null object
ContractTime        25077 non-null object
Company             21242 non-null object
Category            25077 non-null object
Salary per annum    25077 non-null int64
SourceName          25077 non-null object
OpenDate            25077 non-null datetime64[ns]
CloseDate           25077 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(2), object(7)
memory usage: 2.1+ MB


Previously we have done our anormalies detection in column <font color='blue'>**OpenDate**</font>, <font color='blue'>**CloseDate**</font> and <font color='blue'>**Salary per annum**</font>. Now let's have a look at other columns:

In [15]:
df.ContractType.value_counts()

not available    19499
full_time         4883
part_time          695
Name: ContractType, dtype: int64

Column <font color='blue'>**ContractType**</font> only have 3 possible value: `not available`, `full_time`, `part_time`. 
However, the underscore "\_" seem to be some typing error, we need to replace it with dash "-":

In [16]:
def rectify_contractType(data):
    for i,t in enumerate(data['ContractType']):
        if '_' in t:
            rep = t.replace('_','-')
            data.at[i,'ContractType'] = rep
rectify_contractType(df)
df.ContractType.value_counts()

not available    19499
full-time         4883
part-time          695
Name: ContractType, dtype: int64

Better.  
Then, let's look at column <font color='blue'>**ContractTime**</font>:

In [17]:
df.ContractTime.value_counts()

permanent        16194
not available     6212
contract          2671
Name: ContractTime, dtype: int64

Nothing special. How about data in column <font color='blue'>**Category**</font>:

In [18]:
df.Category.value_counts()

IT Jobs                             7085
Healthcare & Nursing Jobs           4334
Engineering Jobs                    3458
Accounting & Finance Jobs           3099
Sales Jobs                          2609
Hospitality & Catering Jobs         2124
Teaching Jobs                       1378
PR, Advertising & Marketing Jobs     990
Name: Category, dtype: int64

How about column <font color='blue'>**Title**</font>, <font color='blue'>**Location**</font>, <font color='blue'>**Company**</font> and <font color='blue'>**SourceName**</font>?  
Data in these corresponding column are naturally diverse with seldom similarity, we just keep what they are.

## 3. Export output

Export the dataframe to `CSV` without indexing:

In [20]:
df.to_csv('./dataset1_solution.csv', index=False)

## 4. Summary

Following the instructions, I have find data problems included:
1. Lexical errors: 'full_time' instead of 'full-time' in  <font color='blue'>**ContractType**</font>.  
2. Irregularities: '24996','22K','24646.8 - 27241.2' in <font color='blue'>**Salary per annum**</font>.
3. Violation of the Integrity constraint: <font color='blue'>**CloseDate**</font> preceding to <font color='blue'>**OpenDate**</font>.
4. Inconsistency: '20180428T150000' against '20182804T150000' in <font color='blue'>**OpenDate**</font>