## Parsing Raw Text Files

- __Python Version__: Python 3.6.5

In [1]:
import sys
print("Python Version: {}".format(sys.version))

Python Version: 3.6.5 (default, Jun 17 2018, 12:13:06) 
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)]


Libraries Used:

1. [__sys__](https://docs.python.org/3.6/library/sys.html#module-sys) (Python Software Foundation, 2018): System module to find the python version
2. [__os__](https://docs.python.org/3/library/os.html#module-os) (Python Software Foundation, 2018): Using the [$listdir()$](https://docs.python.org/3.6/library/os.html#os.listdir) (Python Software Foundation, 2018) method to list the files present in the directory
3. [__re__](https://docs.python.org/3.6/library/re.html#module-re) (Python Software Foundation, 2018): Python package for handling regular expressions.
4. [__json__](https://docs.python.org/3.6/library/json.html) (Python Software Foundation, 2018): Python package for loading and dumping json objects to and from a file.

## 1. Introduction

The task is to extract information from an unstructured text file (i.e. text file regarding multiple job listings) and converting them to 2 different data storage file formats:
1. XML (Extensible Markup Language)
2. json (Javascript Object Notation)

## 2. Importing Packages

In [139]:
import os
import re
import json

## 3. Creating a sample dataset

Creating a sliced version of the job listings text file with top 4 job listings, to run quick tests before validating the solution with the entire dataset.

In [3]:
sample_fn = "./datasets/sample_dataset.dat"
dataset_fn = "./datasets/input.dat"

In [4]:
with open(sample_fn, "w") as fw:
    try:
        with open(dataset_fn) as fh:

            count = 0
            for line in fh:
                if line.strip() == "------------------------------":
                    count = count + 1

                if count != 4:
                    fw.write(line)
                else:
                    fw.write(line)
                    break
    except FileNotFoundError:
        print("File not found. Check folder location.")

Ensuring a check to validate the file has been created or not.

In [5]:
def check_sample():
    flag = False
    dataset_dir = "./datasets"
    for file in os.listdir(dataset_dir):
        if file == "sample_dataset.dat":
            print("File created")
            return True

    print("Sample file not created.")    
    return False

In [6]:
check_sample()

File created


True

## 4. Counting the total job listings provided

From observation of the dataset it can be seen that the dataset is split using a sequence of `-`'s, thus there are two ways of finding the total count of job postings in the dataset:
1. __using an iterative approach__: Counting the number of occurrences of the _unique_ string
2. __using regular expressions__: Designing a regular expression to match the _unique_ string and find the length of the generated array.

__1. Iterative approach__

In [7]:
def count_iterative():
    job_count = 0
    with open(dataset_fn) as fh:
        for line in fh:
            if line.strip() == "------------------------------":
                job_count += 1
    return job_count

__2. Using regular expressions__

In [8]:
def count_re():
    expr = r"---+"
    fh = open(dataset_fn)
    textfile = fh.read()
    job_count = len(re.findall(expr, textfile))
    fh.close()
    return job_count

__Regular Expression used:__ `---+`
<br/>
__Explanation__: Matching 3 or more `-` symbols, which uniquely identifies the separator for each job listing.

In [9]:
job_count = count_iterative()
print("Total Job Postings: {}".format(job_count))

Total Job Postings: 32474


In [10]:
job_count = count_re()
print("Total Job Postings: {}".format(job_count))

Total Job Postings: 32474


Thus by the two approaches it can be verified that there are `32474` job postings in the given dataset.

## 5. Split dataset method

In [11]:
def find_individual_listing(re_val, dataset_file):
    try:
        fh = open(dataset_file)
        textfile = fh.read()
        fh.close()
    except FileNotFoundError:
        print("File not found. Check if <datasets> folder exists")
    
    job_listings = re.findall(re_val, textfile)
    return job_listings

__Method: $find\_individual\_listing()$__

- Parameters: 
    - re_val[str]: Regular expression used for splitting the dataset.
    - dataset_file[str]: Relative file path for the dataset being used i.e.
        - sample_fn: Sample Dataset file path.
        - dataset_fn: Actual Dataset file path.
        
- Return:
    - job_listing[list]: List containing individual job postings.
    
- Returning a list of individual job listings from the provided dataset, using the [$findall()$](https://docs.python.org/3.6/library/re.html#re.findall) (Python Software Foundation, 2018) routine of the `re` module, which matches the expression in the passed string and returns a list of all matched groups.

In [12]:
re_val_1 = r"(ID(?:.*\n)*)"

__Regular Expression used:__ `(ID(?:.*\n)*)` <br/>
__Explanation:__ 

- Match strings starting with the characters `I` and `D`.
- Match all characters thereafter using the expression `(.*\n)*`. 
- The `?:` within `()` denotes that it is a non capturing group. 

In [13]:
sample_job_lists = find_individual_listing(re_val_1, sample_fn)
print("Total listings captured: {}\n".format(len(sample_job_lists)))

Total listings captured: 1



The __re__ used did not split the dataset into individual listings, instead it matched the entire text as one group because of the expression `(.*\n)*` and the fact that regular expressions work in a Greedy manner.

__Solution:__ There must be some stopping point indicating the start of the new split. <br/>
Based on the previous observation each job posting is divided using a defined sequence of `------------------------------`, it may be useful to use this property to find the end of each listings.

In [14]:
re_val_2 = r"(ID(?:.*\n)*(?:---+\n?))"

__Regular Expression used:__ `(ID(?:.*\n)*(?:---+\n?))` <br/>
__Explanation:__ 

- Match strings starting with the characters `I` and `D`.
- Match all characters thereafter using the expression `(.*\n)*`. 
- Using an additional sequence to identify `-`'s present after each job listing i.e. `---+\n?`.
- The expression `\n?` is to cover the last listing in the dataset which does not have a new line character after it.
Thus making `\n` optional.

In [15]:
sample_job_lists = find_individual_listing(re_val_2, sample_fn)
print("Total listings captured: {}\n".format(len(sample_job_lists)))

Total listings captured: 1



The above __re__ as well does not seem to produce the desired output.

__Reason:__ Regular Expressions are greedy in nature and each group would try to capture as many values as possible. Thus the expression `(ID(?:.*\n)*` would match all characters upto the last sequence of `------------------------------`.

__Solution:__ Using the lazy approach of regular expressions, by appending a `?` after the repeating symbol i.e. `*` to make the regular expression capture only what is necessary. 

In [16]:
re_val_3 = "(ID(?:.*\n)*?(?:---+))"

__Regular Expression used:__ `(ID(?:.*\n)*?(?:---+))` <br/>
__Explanation:__ 

- appending a `?` after the `*` to make the regular expression search in a lazy manner.

In [17]:
sample_job_lists = find_individual_listing(re_val_3, sample_fn)
print("Total listings captured: {}\n".format(len(sample_job_lists)))

Total listings captured: 4



The regular expression works well with the sample dataset, validating the same regular expression with the actual dataset.

__Expected length of the list:__ `32474` elements

In [18]:
all_job_lists = find_individual_listing(re_val_3, dataset_fn)
print("Total listings captured: {}\n".format(len(all_job_lists)))

Total listings captured: 32474



Since the number of job listings are validated, the next process would be to extract the features from the dataset.

## 6. Finding individual features:

Observing each individual job listing the following features can be extracted from the dataset:

- __Job ID__
- __Job Procedures__
- __Job Title__
- __Job location__
- __Application Deadline__
- __Start Date__
- __Job Responsibilities__
- __Job Description__
- __Job Salary__
- __About Company__
- __Required Qualifications__

Devising an approach to find a list of all possible feature names using the regular expression:

`(.*?):` which extracts all values before `:`. This step is bound to return garbage values, but all possible feature names would also be covered.

In [138]:
feature_name_re = r"(.*?):"

possible_features = []
for listing in all_job_lists:
    feature_names = re.findall(feature_name_re, listing)
    for val in feature_names:
        if val not in possible_features:
            possible_features.append(val.strip())

In [20]:
print(len(possible_features))

10309


Viewing a slice of the feature names:

In [21]:
possible_features[0:10]

['ID',
 'JOB_PROC',
 ' Please submit your resumes to',
 'http',
 'DEAD_LINE',
 ' 14 October 2014, 18',
 'JOB_T',
 'DATE_START',
 'JOB_LOC',
 'REQUIRED QUALIFICATIONS']

Introducing a look ahead in the regular expression to not include phrases which are immediately followed by:

- a number [To remove dates and times]
- a forward slash: `/` [To remove web links]
- period symbol `.`

And instead of matching all characters, we are restricting the character set to match just alphabets and `_`, and another restriction would be to limit the matching words to either `1` or `2` before the occurrence of `:`

The modified regular expression would be:
`(?:([a-zA-Z_]+(?: [a-zA-Z_]+?)?):)(?![0-9/.])`

In [22]:
feature_name_re = r"(?:([a-zA-Z_]+(?: [a-zA-Z_]+?)?):)(?![0-9/.])"

In [23]:
stopwords = []
with open("stopwords_en.txt") as fh:
    for line in fh:
        val = line.strip()
        if val != "about":
            stopwords.append(val)

In [24]:
possible_features = set()
for listing in all_job_lists:
    feature_names = re.findall(feature_name_re, listing)
    for val in feature_names:
        possible_features.add(val)

In [25]:
print(len(possible_features))

3409


From this list of possible features, we can get a miniaturised idea on the possible feature names to be validaed for a particular feature in the regular expression.

### 6.1 Job ID:

In [26]:
def find_feature_names(feature_names, possible_features=[]):
    names = []
    for name in possible_features:
        for feature_name in feature_names:
            if feature_name in name.lower():
                if name not in names:
                    names.append(name)
    
    for val in names:
        print(val)
    return

__Method: $find\_feature\_names()$__

- Parameters: 
    - feature_name[list]: Feature names to be checked.
    - possible_features[list]: List of all possible feature names.
    
- Prints the possible feature names for the given $feature\_name$

In [27]:
find_feature_names(["id"], possible_features)

Provide interpretation
confidentiality
bid packages
candidate must
considerations
provided below
Fiduciary Controls
considered if
Lordkipanidze at
Friday
and provide
side applications
ideas
and outside
s Guidelines
provided through
national identity
style guide
their bid
consideration to
candidate to
Incident Handling
candidates should
the candidate
Candidates must
Bajelidze at
Ordinary Resident
Video materials
a guideline
be elucidated
Resident Representative
be considered
must provide
providing services
considered
candidate should
Guidelines
provided services
video coverage
and confidentiality
environmental guidelines
side technologies
Bank Guidelines
candidate will
Ordinarily Resident
ID
BDS providers
providers through
Banks Guidelines
Video Production
Groups guidelines


From this list it is evident that there are no other variations for the feature name other than `ID`.

Looking at one single job listing in isolation from the others.

In [28]:
print(sample_job_lists[0])

ID: 29711
JOB_PROC: Please submit your resumes to:http://tbe.taleo.net/NA6/ats/careers/requisition.jsp?org=QUESTRADE&cws=1&rid=223
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
DEAD_LINE: 14 October 2014, 18:00
JOB_T: Head of Corporate Customers Relationship Management Division
OPEN TO/
DATE_START: 12 July 2005
JOB_LOC: Yerevan, Armenia
REQUIRED QUALIFICATIONS:
 - Higher education;
- At least 2 years of work experience in programming.
REMUNERATION/
ABOUT COMPANY:
 According to Brown-Wilson Group Survey* EPAM Systems is
the #1 software engineering outsourcing services provider in Central and
Eastern Europe. Founded in 1993, EPAM maintains North American
headquarters in Lawrenceville, NJ. 
Currently there are 3500+ highly qualified IT professionals working at
EPAM Systems. EPAM software development centers are located in Russia,
Hungary, Belarus, Ukraine an

__observations:__
1. Fairly distinctive patterns to identify majority __features__ and their corresponding __values__.
2. However, saying that there are some questionable markers like:
    - `JOB_T`: Seems like an incomplete feature name.
    - `OPEN TO/` and `REMUNERATION/`: All feature names are separated from their values using a `:`, however in this case the separator seems to be `/`, which is something to be noted.
    - `tasks:` : (under JOB DESCRIPTION) By having an overview analysis it seemed like the best paramater to separate features from values would be `:` barring a few cases. However there are present some phrases which even though are not feature names end with a `:`.  
3. Presence of multiline values for different features line `About Company` which may be difficult to capture.

In [29]:
id_re = r"ID: (.*)"

__Regular Expression used:__ `ID: (.*)` <br/>
__Explanation:__ 

- Match strings starting with the characters `I` and `D`.
- Match all characters thereafter within the same line using the expression `(.*)`, which is also used as the matching group.

Performing initial tests on the sample dataset.

In [30]:
job_ids = [
    re.findall(id_re, val)[0]
    for val in sample_job_lists
]

In [31]:
len(job_ids)

4

There seems to be no issues in capturing values from the sample dataset. Using the regular expression for the actual dataset.

In [32]:
job_ids = [
    re.findall(id_re, val)[0]
    for val in all_job_lists
]

In [33]:
print("Total job ids fetched are: {}".format(len(job_ids)))

Total job ids fetched are: 32474


All job id values have been fetched from the dataset, checking for empty values in the list.

In [34]:
job_id_bool = [
    True
    if len(val) > 0
    else 
        False
    for val in job_ids
]

In [35]:
all(job_id_bool)

True

There are no empty values obtained from the dataset.
<br/>
A check can be done to see if all ID values in the list are of length `5`.

In [36]:
job_id_bool = [
    True
    if len(val) == 5
    else
        False
    for val in job_ids
]

In [37]:
all(job_id_bool)

True

The [$all()$](https://docs.python.org/3.6/library/functions.html#all) (Python Software Foundation, 2018) routine returns __True__ if all the values in the list are __True__. If any value turns out to be __False__, then __False__ is returned.

Thus using the $all()$ routine it can be concluded that all job ids within the dataset are of length `5`

Converting the job ids from `str` into `int`

In [38]:
job_id_int = [
    int(val)
    for val in job_ids
]

In [39]:
print("Total job ids captured: {}".format(len(job_id_int)))

Total job ids captured: 32474


### 6.2 Job Procedure:

Getting a list of possible feature names for `Job Procedure`

In [40]:
find_feature_names(["proc", "procedure"], possible_features)

to process
JOB_PROCS
Related processes
PROCEDURE
sales process
Procurement
procurement
company procedures
transformation processes
and procedures
process discipline
Procurement Division
recruitment processes
procurements of
Process management
and Procedures
repair processes
JOB_PROC
Process Expertise
the process
bussing procedures
processes including
planning process
Procurement Activities
following processes
performed processes
PROCEDURES
standard process
procedure
budget process
procurement procedures
Procedures govern
administrative procurement
Transformation Processes
Assurance processes
procedures
process
Agile processes
collection process
HR procedures
Procedures
Improvement Process


Possible feature names could be:
1. JOB_PROC
2. PROCEDURE
3. PROCEDURES
4. JOB_PROCS
5. procedures
7. procedure
6. Procedures

__Analysis:__

1. Multiline feature
2. Followed by Application Deadline in majority cases.

Based on these observations, a regular expression for Job Procedures would be:

In [41]:
job_procedure_re = r"(?:(?:JOB_PROCS?)|(?:PROCEDURES?))((?:.*\n)*?)(?:(?:DEAD_LINE)|(?:APPLICATION_DL)|(?:APPLICATION_DEADL))"

In [42]:
job_procedures = [
    re.findall(job_procedure_re, val)[0]
    if len(re.findall(job_procedure_re, val)) > 0
    else
        None
    for val in sample_job_lists
]

In [43]:
print(len(job_procedures))

4


The regular expression has extracted the `Job procedures` from the sample dataset, displaying a single entry to validate the result. 

In [44]:
print(job_procedures[0])

: Please submit your resumes to:http://tbe.taleo.net/NA6/ats/careers/requisition.jsp?org=QUESTRADE&cws=1&rid=223
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.



There is an error, in the above text:

- `:` has been included as a part of the procedure, which must be removed.

1. The regular expression also seems to be a bit position dependent (i.e. the value of Job procedure would always be above application deadline). 

On having a manual lookup of the dataset, a unique pattern can be formed, that is each Job procedure if present, ends with the line: 
```
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
```

This is a repeated line, hence can be used as a stopping point for each job procedure.

2. After the lookup it was also evident that the identifier for Job Procedures is not always JOB_PROCS or PROCEDURES, there are cases when it is denoted using lower case characters. To cover the particular case and to create a generalized regular expression, the re would just include `PROC` in both upper and lower case, since it is the common term.

Thus modifying the regular expression based on the above arguments.

In [45]:
job_procedure_re = r"(?:(?:(?:PROC)|(?:proc)).*?: )((?:.*\n)*?.*Thanks\.)"

In [46]:
job_procedures = [
    re.findall(job_procedure_re, val)[0]
    if len(re.findall(job_procedure_re, val)) > 0
    else
        None
    for val in sample_job_lists
]

In [47]:
print(job_procedures[0])

Please submit your resumes to:http://tbe.taleo.net/NA6/ats/careers/requisition.jsp?org=QUESTRADE&cws=1&rid=223
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.


The result appears to be correct. However for the regular expression instead of using the `|` (or) separator, we can use `[]` to use as a character set for the string `PROC` or `proc`.

In [48]:
job_procedure_re = r"(?:(?:[pP][rR][oO][cC]).*?: )((?:.*\n)*?.*Thanks\.)"

Using the above regular expression on the actual dataset.

The code snippet below uses `if-else` construct within list comprehension, to include all job procedures if captured else using `None` for missing values to maintain the indexing.

In [49]:
job_procedures = [
    re.findall(job_procedure_re, val)[0]
    if len(re.findall(job_procedure_re, val)) > 0
    else
        None
    for val in all_job_lists
]

In [50]:
print(len(job_procedures))

32474


Checking which indices are `None`. So that a manual verification can be done.

In [51]:
print([index for index, val in enumerate(job_procedures) if val == None][2:5])
print([index for index, val in enumerate(job_procedures) if val == None][1000:1005])
print([index for index, val in enumerate(job_procedures) if val == None][-3:])

[34, 47, 62]
[12999, 13005, 13020, 13023, 13078]
[32432, 32441, 32448]


In [52]:
print(all_job_lists[32448])

ID: 60237
DEADLINES: 20 January 2006
start_date: 09 July 2015
LOCATIONS: Yerevan, Armenia
REQUIRED QUALIFICATIONS:
 - University degree; 
- At least 2 senior level experience in SME Lending in internationally
co-funded programs (EBRD, KFW, GAF, etc.); 
- Excellent knowledge of relevant legal and regulatory aspects;
- Excellent knowledge of national accounting standards;
- Strong analytical and problem solving skills; 
- Strong interpersonal skills;
- Organizational skills and great team player; 
- Ability to work under pressure; 
- Readiness for extensive countrywide travel;
- Excellent knowledge of Russian and/ or English languages (oral and
written).
COMPANYS_INFO:
 UniCAD is a software start-up company specialized in the
development of Electronic Design Automation (EDA) CAD tools, which is
located in Yerevan, Armenia. UniCAD is a fully owned subsidiary of
E-Z-CAD that is situated in the heart of Silicon Valley in Mountain
View, CA, USA.
SALARY: According to the ""S"" grade of the ba

### 6.3 Job Title:

__Not performing checks on the sample dataset from here on, since the execution speed for the actual dataset is manageable.__

Getting a list of possible feature names for `Job Titles`

In [53]:
find_feature_names(["tl", "_t"], possible_features)

Supervisors Title
TITLES
title
report outlining
JOB TITLE
Resettlement
JOB_T
cattle breeding
title to
of outlets
Resettlement projects
strictly to
_TTL
directly to


From the above list the variations for `Job Title` possible are:
1. JOB_T
2. \_TTL
3. title
4. TITLES
5. JOB TITLE

__Analysis:__

1. Single Line Feature. Therefore not many checks need to be made.

Based on these observations, a regular expression for Job Procedures would be:

In [54]:
job_title_re = r"(?:(?:(?:_T.*)|(?:(?:JOB )?[tT][iI][tT][lL][eE][S]?)): )(.*)"

__Regular Expression used:__ `(?:(?:(?:_T.*)|(?:(?:JOB )?[tT][iI][tT][lL][eE][S]?)): )(.*)` <br/>
__Explanation:__ 

- Match strings with the optional strings characters containing either (`_T`) or (`JOB ` followed by the string `TITLE` or `TITLES` in both alphabet cases).
- Since `JOB ` and the last `S` are optional the __re__ also matches the name `title`.
- Matching all characters thereafter within the same line using the expression `(.*)`, to capture the `Job Title` for the listing.

In [55]:
job_titles = [
    re.findall(job_title_re, val)[0]
    if len(re.findall(job_title_re, val)) > 0
    else
        None
    for val in all_job_lists
]

In [56]:
len(job_titles)

32474

Finding the indices for job titles which have value `None`, and performing manual validation.

In [57]:
print([index for index, val in enumerate(job_titles) if val == None][2:5])
print([index for index, val in enumerate(job_titles) if val == None][1000:1005])
print([index for index, val in enumerate(job_titles) if val == None][-3:])

[46, 72, 76]
[14002, 14022, 14031, 14038, 14051]
[32428, 32448, 32456]


Total empty values in all listings.

In [58]:
len([_ for val in job_titles if val == None])

2282

In [59]:
print(all_job_lists[32456])

ID: 40386
JOB_PROCS: All qualified and interested candidates should
submit their CVs/ resumes to: jobs.armenia@... . Please mention
""JobID 0214"" in the subject line of the email. Only shortlisted
candidates will be notified for the interview.
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
DEAD_LINE: 25 April 2007
START DATE: 12 April 2011
LOCATION: Yerevan, Armenia
REQUIRED QUALIFICATIONS:
 - University degree in Finance, Economics or Accounting;
- Minimum 3 years of work experience in accounting; 
- Good knowledge of RA Legislation on Taxation and accounting
standards;
- Knowledge of accounting software;
- Fluency in English language is a plus;
- Awareness of details, accuracy and reliability;
- Good analytical and organizational skills;
- Excellent communication skills.
ABOUT:
 ""Youth For Achievements"" is an educational NGO created
with the mission t

### 6.4 Job Location:

Getting a list of possible feature names for `Job Location`

In [60]:
find_feature_names(["loc"], possible_features)

local fundraising
Local Levels
Job location
LOCATIONS
locations
local legislation
LOCATION
JOB_LOC
_LOCS
_LOC
located at
local requirements
Local Partners
local levels
local website


From the above list the variations for `Job Location` possible are:
1. JOB_LOC
2. LOCATION
3. \_LOC
4. LOCATIONS
5. \_LOCS
6. Job location
7. locations

__Analysis:__

1. Single Line Feature. Therefore not many checks need to be made.
2. Every variation of the feature name that includes the string `loc` which can be used as a part of the __re__.

Based on these observations, a regular expression for Job Locations would be:

In [61]:
location_re = r"(?:(?:[lL][oO][cC]).*: )(.*)"

__Regular Expression used:__ `(?:(?:[lL][oO][cC]).*: )(.*)` <br/>
__Explanation:__ 

- Match strings containing `LOC` in both alphabetical cases as a part of the string followed by all characters till the last occurence of `: `
- Matching all characters thereafter within the same line using the expression `(.*)`, to capture the `Job Locations` for the listing.

In [62]:
job_locations = [
    re.findall(location_re, val)[0]
    if len(re.findall(location_re, val)) > 0
    else
        None
    for val in all_job_lists
]

In [63]:
len(job_locations)

32474

Finding the indices for job titles which have value `None`, and performing manual validation.

In [64]:
print([index for index, val in enumerate(job_locations) if val == None][2:5])
print([index for index, val in enumerate(job_locations) if val == None][1000:1005])
print([index for index, val in enumerate(job_locations) if val == None][-3:])

[40, 48, 54]
[14584, 14598, 14601, 14607, 14611]
[32408, 32424, 32429]


Total empty values in all listings.

In [65]:
len([_ for val in job_locations if val == None])

2290

Validating some cases for `None` values in the list.

In [66]:
print(all_job_lists[32429])

ID: 93906
procedures: A complete application form should consist of a
letter of motivation and a full CV. Applications can be submitted by
e-mail to: mlnpharm@... . Please indicate the position applied
for in the subject line. Incomplete or late applications will not be
considered. Only short listed applicants will be contacted.
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
DEAD_LINE: 17 December 2007
TITLES: Head of Regional Sales Group, Corporate Sales Unit
START DATE: 15 February 2012
qualifications:
 - Higher education with major in marketing (MBA preferred);
- Experience in marketing/ sales is preferable;
- Excellent analytical skills;
- Ability to work in a team;
- Ability to work under pressure;
- Strong organizational and interpersonal skills;
- Good negotiation skills;
- Good computer skills;
- Fluent knowledge of English, Russian and Armenian la

### 6.5 Application Deadline:

Getting a list of possible feature names for `Application Deadline`

In [67]:
find_feature_names(["dl", "deadline", "dead"], possible_features)

the deadline
DEAD_LINE
Initial deadline
APPLICATION_DL
deadline at
deadline to
APPLICATION_DEADL
DEADLINES
Incident Handling
stated deadlines
deadline
kindly visit
kindly call


From the above list the variations for `Application Deadline` possible feature names are:
1. DEAD_LINE
2. APPLICATION_DEADL
3. APPLICATION_DL
4. deadline
5. DEADLINES

__Analysis:__

1. Single Line Feature. Therefore not many checks need to be made.
2. Every variation of the feature name that includes the string `DL` or (`DEAD` in both alphabetical cases) which can be used as a part of the __re__.

Based on these observations, a regular expression for `Application Deadline` would be:

In [68]:
deadline_re = r"(?:(?:(?:[dD][eE][aA][dD])|(?:DL)).*: )(.*)"

__Regular Expression used:__ `(?:(?:(?:[dD][eE][aA][dD])|(?:DL)).*: )(.*)` <br/>
__Explanation:__ 

- Match strings containing `DEAD` in both alphabetical cases or `DL` as a part of the string followed by all characters till the last occurence of `: `
- Matching all characters thereafter within the same line using the expression `(.*)`, to capture the `Application Deadlines` for the listing.

In [69]:
job_deadlines = [
    re.findall(deadline_re, val)[0]
    if len(re.findall(deadline_re, val)) > 0
    else
        None
    for val in all_job_lists
]

Total empty values in all listings.

In [70]:
len([_ for val in job_deadlines if val == None])

2242

Validating some cases for `None` values in the list.

In [71]:
print([index for index, val in enumerate(job_deadlines) if val == None][2:5])
print([index for index, val in enumerate(job_deadlines) if val == None][1000:1005])
print([index for index, val in enumerate(job_deadlines) if val == None][-3:])

[42, 51, 74]
[14688, 14697, 14703, 14739, 14748]
[32447, 32450, 32463]


In [72]:
print(all_job_lists[32463])

ID: 62067
JOB_PROC: If you are interested in this position, please
send your cover letter and CV to: job@....
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
JOB_T: User Interface/ Web Designer
START DATE/
start_date: 19 December 2008
_LOC: Yerevan, Armenia
REQ_QUALS:
 - University degree in Management, Social Sciences, Technical Sciences or
Economics; higher degree is a plus;
- At least 5 years of professional experience in the field of energy
efficiency and solid knowledge and understanding of the local residential
buildings sector and industry;
- Experience in housing finance and banking is an asset;
- Experience in project cycle management with exposure to financial
management;
- Experience in coordinating stakeholders, consultants and other parties
from designing to commissioning phase of the project;
- Experience with facilitation of direct beneficiar

### 6.6 Start Dates

Getting a list of possible feature names for `Job Start Date`

In [73]:
find_feature_names(["start", "date", "dt"], possible_features)

candidate must
getting started
Starting date
START_DA
START DATE
DATE_START
Starting salary
DATES
candidate to
candidates should
the candidate
Candidates must
Starting Salary
be elucidated
mandate to
candidate should
candidate will
start_date


From the above list the variations for `Job Start Date` possible feature names are:
1. DATE_START
2. START DATE
3. START_DA
4. DATES
5. start_date

__Analysis:__

1. Single Line Feature, however some listings have multiple lines. 
2. Variations of the feature name that includes the string `_DA`, `DATE`(with an additional case of including `S`) can be used as a part of the __re__.

Based on these observations, a regular expression for `Start Date` would be:

__Assumption:__ `Start Date` would be captured only if it is a Date, else kept as `None`

In [74]:
job_date_re = r"(?:(?:(?:_DA)|(?:[dD][aA][tT][eE]S?)).*?: )(?=[0-9])(.*)"

Adding an additional forward lookup, since dates must start with a numeric value

In [75]:
job_dates = [
    re.findall(job_date_re, val)[0]
    if len(re.findall(job_date_re, val)) > 0
    else
        None
    for val in all_job_lists
]

In [76]:
len([_ for val in job_dates if val == None])

2326

In [77]:
print([index for index, val in enumerate(job_dates) if val == None][2:5])
print([index for index, val in enumerate(job_dates) if val == None][1000:1005])
print([index for index, val in enumerate(job_dates) if val == None][-3:])

[39, 60, 85]
[14732, 14733, 14747, 14749, 14753]
[32458, 32459, 32465]


In [78]:
job_id_int.index(37202)

13270

In [79]:
job_dates[13270]

'02 July 2007'

In [80]:
print(all_job_lists[13270])

ID: 37202
procedures: All interested candidates are requested to
submit the cover letter and CV to: office@..., with obligatory copy
to anahit@.... Please write in the subject: Application for (name
position), Armenia  and your full name.
It is strongly recommended that all candidates visit website of Heifer
International www.heifer.org and www.hpi.am prior to applying for the
position.  
Short-list candidates will be invited for interviews in the Heifer
Armenia office between May 14 and 25. Time for interviews will be
announced while contacting the short-list candidates. 
Starting date: to be discussed with successful candidates, but no later
than July 1, 2007.
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
APPLICATION_DL: 10 August 2013
JOB_T: Radio Engineer
DATE_START: 02 July 2007
LOCATION: Yerevan, Armenia
REQUIRED QUALIFICATIONS:
 Corporate Competenc

### 6.7. Job Responsibilities:

Considering Job Responsibility as the next feature since it is the last feature pattern in each job listing and thus the ending `----` sequence can be used for identification purposes.

Getting a list of possible feature names for `Job Responsibilities`

In [81]:
find_feature_names(["resp"], possible_features)

Assistants Responsibilities
Responsibilities include
Response Management
Key Responsibilities
JOB_RESPS
Operational Responsibilities
Cashier Responsibilities
Budget responsibilities
responsibilities include
responsibilities are
Doctor Responsibilities
of responsibility
wholly responsible
Emergency Response
Responsible for
Secondary Responsibilities
HR Responsibilities
Key responsibilities
RESP
Other Responsibilities
technical responsibilities
General Responsibilities
Other responsibilities
responsible for
s responsibilities
following responsibilities
Major Responsibilities
response operations
Administrative Responsibilities
Primary Responsibilities
responsibilities
related responsibilities
corresponding fields
main responsibilities
Assistant Responsibilities
and Responsibilities
Specific responsibilities
General responsibilities
Revenue responsibility
RESPONSIBILITY
responsibility to
Specific Responsibilities
corresponding policies
respective changes
JOB RESPONSIBILITIES


From the above list the variations for `Job Responsibilities` possible feature names are:
1. RESP
2. JOB RESPONSIBILITIES
3. JOB_RESPS
4. RESPONSIBILITY

__Analysis:__

1. Multiline feature, last feature present in each listing. 
2. Variations of the feature name that includes the string `RESP` can be used as a part of the __re__.
3. However there is one exception, there is a feature name `responsibilities`, which needs to be individually selected and must not be preceded by a space. (this can be achieved using negative look behind) 

Based on these observations, a regular expression for `Job Responsibility` would be:

In [82]:
job_responsibility_re = r"(?:(?:(?:(?:RESP).*)|(?:(?<![ ])responsibilities)):)((?:.*\n)*?)(?=---+)"

__Regular Expression used:__ `(?:(?:(?:(?:RESP).*)|(?:(?<![ ])responsibilities)):)((?:.*\n)*?)(?=---+)` <br/>
__Explanation:__ 

- Match strings containing `RESP` as a part of the string followed by all characters till the last occurence of `: `
- Consideration was also taken for the feature name `responsibilities` which has been preceded with a negative look behind for negating ` `. This has been done to prevent capturing groups like `General responsibilities`.
- Matching all characters thereafter within the same line using the expression `(.*\n)*`
- The end for each feature value is found using the `---` sequence since it is the last feature.

In [83]:
job_responsibility = [
    re.findall(job_responsibility_re, val)[0]
    if len(re.findall(job_responsibility_re, val)) > 0
    else
        None
    for val in all_job_lists
]

In [84]:
len([_ for val in job_responsibility if val == None])

2324

In [85]:
print([index for index, val in enumerate(job_responsibility) if val == None][2:5])
print([index for index, val in enumerate(job_responsibility) if val == None][1000:1005])
print([index for index, val in enumerate(job_responsibility) if val == None][-3:])

[11, 14, 88]
[13567, 13568, 13581, 13644, 13668]
[32406, 32430, 32431]


In [86]:
print(all_job_lists[32431])

ID: 98709
PROCEDURE: All interested and qualified candidates are
encouraged to fill in the last updated version of HSBC Application Form
attached to this announcement and email it to: vacancy.armenia@... .
The old versions of application forms will not be reviewed. Only
short-listed candidates will be invited for interviews. Please put on the
subject line of the e-mail ""Branch Representative"".
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
DEAD_LINE: 15 April 2007, 5:00 p.m.
_TTL: Head of Internal Audit
start_date: 25 January 2008
LOCATION: Yerevan, Armenia
REQUIRED QUALIFICATIONS:
 - University degree in Law; 
- At leas 1 year of work experience in the Central Bank or 2 years of
work experience elsewhere;
- Advanced knowledge of corporate, banking, labour, civil, civil
procedures, constitutional, procurement, public tenders, currency, tax,
international

Each responisibilty has multiple bullet points, thus seperating each bullet point, and storing it in a multidimensional list.

In [87]:
job_responsibility_ind = [
    re.split(r"- ", val)
    if val is not None
    else
        None
    for val in job_responsibility
]

Splitting of the individual bullet points has been done using the [re.split()](https://docs.python.org/3.6/library/re.html#re.split) (Python Software Foundation, 2018) method.

In [88]:
job_responsibility_ind = [
    [
        v.strip()
        if val is not None
        else
            None
        for v in val
        if len(v.strip()) > 0
    ]
    if val is not None
    else
        []
    for val in job_responsibility_ind
]

In [89]:
print(job_responsibility[100])


 - Attract potential clients;
- Evaluate the credit risk of clients before the credit committee;
- Prepare and present required reports and documentation to the credit
committee;
- Inform clients and guarantors on their rights and obligations;
- Monitor client businesses to ensure the continuance ability to repay;
- Ensure on time and correct payments;
- Participate in classroom and on the job trainings.



In [90]:
print(job_responsibility_ind[100])

['Attract potential clients;', 'Evaluate the credit risk of clients before the credit committee;', 'Prepare and present required reports and documentation to the credit\ncommittee;', 'Inform clients and guarantors on their rights and obligations;', 'Monitor client businesses to ensure the continuance ability to repay;', 'Ensure on time and correct payments;', 'Participate in classroom and on the job trainings.']


### 6.8 Job Description

Getting a list of possible feature names for `Job Description`

In [91]:
find_feature_names(["desc"], possible_features)

_description
description languages
JOB DESCRIPTION
DESCRIPTION
job_desc
descriptions to
JOB_DESC


From the above list the variations for `Job Description` possible feature names are:
1. JOB DESCRIPTION
2. job_desc
3. \_description
4. JOB_DESC
5. DESCRIPTION

__Analysis:__

1. Multiline feature, second last feature present in each listing, because of which `Job Responsibility` can be used as the identifying marker for the ending, but at the same time since there are __`2324`__ null entries for `Job Responsibility`. The ending identifier must have `---` the sequence as well. 
2. Variations of the feature name that includes the string `DESC` can be used as a part of the __re__.
3. However just using `DESC` variations are not sufficient since, there is a token `description languages` which might generate garbage values.
To overcome this the keyword `JOB` could be as a preceding occurence, else it shouldn't be preceded by a ` ` (which can be achieved using a negative look behind)

Based on these observations, a regular expression for `Job Responsibility` would be:

In [92]:
job_description_re = r"(?:(?:(?:(?:JOB )|(?<![ ]))[dD][eE][sS][cC]).*:)((?:.*\n)*?)(?:(?:.*RESP.*:)|(?:[ ]responsibilities:)|(?:---+))"

__Regular Expression used:__ 

`(?:(?:(?:(?:JOB )|(?<![ ]))[dD][eE][sS][cC]).*:)((?:.*\n)*?)(?:(?:.*RESP.*:)|(?:[ ]responsibilities:)|(?:---+))` <br/>
__Explanation:__ 

- Match strings containing `DESC` in both cases as a part of the string followed by all characters till the last occurence of `: `
- As a part of deductions from the initial analysis, the keyword `JOB ` is searched before the occurence of any variation of `DESC`, if False then the negative look behind enters in and checks whether the sequence is not preceded by a ` `.
- Matching all characters thereafter within the same line using the expression `(.*\n)*`
- The end for each feature value is found using the same regular expression used to find `Job Responsibilities` as mentioned in the initial analysis.

In [93]:
job_description = [
    re.findall(job_description_re, val)[0]
    if len(re.findall(job_description_re, val)) > 0
    else
        None
    for val in all_job_lists
]

In [94]:
print([index for index, val in enumerate(job_description) if val == None][2:5])
print([index for index, val in enumerate(job_description) if val == None][1000:1005])
print([index for index, val in enumerate(job_description) if val == None][-3:])

[38, 55, 59]
[14616, 14618, 14646, 14647, 14685]
[32386, 32419, 32431]


In [95]:
print(all_job_lists[32431])

ID: 98709
PROCEDURE: All interested and qualified candidates are
encouraged to fill in the last updated version of HSBC Application Form
attached to this announcement and email it to: vacancy.armenia@... .
The old versions of application forms will not be reviewed. Only
short-listed candidates will be invited for interviews. Please put on the
subject line of the e-mail ""Branch Representative"".
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
DEAD_LINE: 15 April 2007, 5:00 p.m.
_TTL: Head of Internal Audit
start_date: 25 January 2008
LOCATION: Yerevan, Armenia
REQUIRED QUALIFICATIONS:
 - University degree in Law; 
- At leas 1 year of work experience in the Central Bank or 2 years of
work experience elsewhere;
- Advanced knowledge of corporate, banking, labour, civil, civil
procedures, constitutional, procurement, public tenders, currency, tax,
international

In [96]:
len([_ for val in job_description if val == None])

2252

From the sample files provided for xml and json, it can be inferred that `Job Description` has sub elements within the main element, thus spitting each description by `. ` and storing it in a multidimensional list.

In [97]:
job_description_ind = [
    re.split(r"\. ", val)
    if val is not None
    else
        None
    for val in job_description
]

In [98]:
job_description_ind = [
    [
        v.strip()
        if val is not None
        else
            None
        for v in val
        if len(v.strip()) > 0
    ]
    if val is not None
    else
        []
    for val in job_description_ind
]

In [99]:
job_description_ind[7750]

['Financial Sector Deepening Project (FSDP) will be\nassisting Ministry of Finance and Economy (MFE) in the official\ntranslation and national adoption of International Financial Reporting\nStandards (IFRS) in accordance with the requirements and procedure of\nInternational Accounting Standards Board (IASB) and Review Committee\n(RC) Experts',
 'In this regard, FSDP announces 3-4 job vacancies for\nIFRS/IAS Translation Review Committee Experts, who will be responsible\nfor review, editing, and revision of the Armenian translation of the\ntranslated International Financial Reporting Standards (IFRS) in\naccordance with the IASB/IASCF requirements and procedures.\nThe Review Committee Experts must be approved by the IASC Foundation.']

### 6.9 Job Salary

Getting a list of possible feature names for `Job Salary`

In [100]:
find_feature_names(["sal", "remun", "job_sal", "job_remun"], possible_features)

net salary
sales process
Initial salary
REMUNERATION
sales statistics
Proactive Sales
equipment sales
property sales
remuneration
Proposal
Sallary
Annual salary
Starting salary
sales activities
Sales
Salary
Starting Salary
Sales activities
JOB_SAL
Sales skills
salary
Fixed salary
competitive salary
Salary Range
SALARY
Aftersales businesses
Sales Director
salary to
salaries to


From the above list the variations for `Job Salary` possible feature names are:
1. JOB_SAL
2. SALARY
3. REMUNERATION
4. salary
5. remuneration

__Analysis:__

1. Multiline feature, with a short body.
2. Variations of the feature name that includes the string `SALA` and `REMUNERATION` that can be used as a part of the __re__.
3. `SALA` has been used to eliminate the tokens having `Sales` as a substring.
4. However in doing so, an additional matching group must be added specifically for `JOB_SAL`.
5. Using a positive look ahead to match the first occurrence of `:` or the sequence `---` which indicates the next feature name or end of listing respectively.

Based on these observations, a regular expression for `Job Responsibility` would be:

In [101]:
job_salary_re = r"(?:(?:(?:[rR][eE][mM][uU][nN])|(?:JOB_SAL)|(?:(?<![ ])[sS][aA][lL][aA])).*:)((?:.*\n)*?)(?=(?:(?:.*:)|(?:---+)))"

__Regular Expression used:__ 

`(?:(?:(?:[rR][eE][mM][uU][nN])|(?:JOB_SAL)|(?:(?<![ ])[sS][aA][lL][aA])).*:)((?:.*\n)*?)(?=(?:(?:.*:)|(?:---+)))` <br/>
__Explanation:__ 

- Match strings containing `SALA` and `REMUN` in both cases as a part of the string followed by all characters till the last occurence of `: `
- As a part of deductions from the initial analysis, an additional matching group has been added for `JOB_SAL`.
- A negative look behind for matching `SALA` has been used to eliminate feature names where `salary` or any other variation is used as the second word.
- Matching all characters thereafter within the same line using the expression `(.*\n)*`
- The end for each feature value is found using a positive look ahead by matching the finding the first occurence of `:` or the sequence of `---+`

In [102]:
job_salary = [
    re.findall(job_salary_re, val)[0]
    if len(re.findall(job_salary_re, val)) > 0
    else
        None
    for val in all_job_lists
]

Finding the indices for `None` values to perform manual validation.

In [103]:
print([index for index, val in enumerate(job_salary) if val == None][2:5])
print([index for index, val in enumerate(job_salary) if val == None][1000:1005])
print([index for index, val in enumerate(job_salary) if val == None][-3:])

[40, 64, 73]
[13834, 13857, 13863, 13881, 13883]
[32417, 32426, 32451]


In [104]:
len([_ for val in job_salary if val == None])

2356

In [105]:
print(all_job_lists[32451])

ID: 28048
PROCEDURES: Please send your CV in Russian or Armenian
languages with a recent photo to: ngyulzadyan@... .
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
DEAD_LINE: 22 September 2005
JOB_T: Software Developer, Software Development Unit, IT and Automation
Division
START DATE/
DATES: 09 March 2007
_LOCS: Yerevan, Armenia
QUALIFS:
 - Higher Humanitarian or Economic education;
- At least 2 years of work experience in the relevant field;
- Excellent knowledge of business ethic and behavior;
- Ability to work under pressure;
- High sense of responsibility and diligence;
- Ability to work with confidential information and top secret
documentation; 
- Fluency in Armenian and Russian languages; good knowledge of English
language;
- Good knowledge of MS Office tools.
ABOUT:
 
JOB DESCRIPTION: We are looking for qualified Web Developers.
responsibilities:
 

### 6.10 About Company

In [106]:
find_feature_names(["about", "abt", "company", "info"], possible_features)

following information
company procedures
information call
the Company
ABOUT
company website
Contact Info
company at
About Trainers
about
COMPANYS_INFO
Information Technology
information management
Company at
company interests
Company on
technical information
information systems
the company
Information
information about
Information management
Contact Information
company address
information
more information
_info
Additional Information
company property
company employees
company staff
Information Ordering
contact information
about_company
with information
company visit
Information Flow
management company
information visit
ABOUT COMPANY


From the above list the variations for `Company Information` possible feature names are:
1. ABOUT COMPANY
2. ABOUT
3. \_info
4. about_company
5. COMPANYS_INFO

Feature names have been identified, but there seems to be no unique way to find the ending point for `Company Information`, unless use every remaining key present after `Company Information` to identify the end.

However, that seems unrealistic. Another approach would be to create a copy of the dataset, and removing features from the end and their values that have been captured. Thus leaving only `Company Information` at the end which can then be easily identified.

In [107]:
def remove_end(data=None, feature=None):
    if data == None:
        return
    else:
        copy_data = []
        
        for listing in data:
            index = -1
            new_listing = listing
            for name in feature:
                index = listing.find(name+":")
                if index != -1:
                    break
            
            if index != -1:
                new_listing = listing[:index] + listing[-30:]
            copy_data.append(new_listing)
    return copy_data

__Method: $remove\_end()$__

- Parameters: 
    - data[list]: Dataset for each listing.
    - feature[list]: List of all possible feature names which are to be matched to remove from each listing.

- Returns: 
    - copy_data[list]: Dataset listings with the removed feature.
 
- Removes the possible combination for the given $features$ from each listing.

Removing `Job Responsibility`, `Job Description` and `Job Salary` from each listing in the same order.

In [108]:
responsibility_feature = [
    "RESP",
    "JOB RESPONSIBILITIES",
    "JOB_RESPS",
    "RESPONSIBILITY",
    "responsibilities"
]

In [109]:
description_feature = [
    "JOB DESCRIPTION",
    "job_desc",
    "_description",
    "JOB_DESC",
    "DESCRIPTION"
]

In [110]:
salary_feature = [
    "JOB_SAL",
    "SALARY",
    "REMUNERATION",
    "salary",
    "remuneration"
]

In [111]:
features = [
    responsibility_feature,
    description_feature,
    salary_feature
]

In [112]:
new_data = all_job_lists.copy()
for feature in features:
    new_data = remove_end(new_data, feature=feature)

__Analysis:__

1. Multiline Feature
2. Variations of the feature name that includes the string `_info`, and `about` along with the different cases can be used as a part of the __re__.
3. Since `Company Information` is the last feature for the new data, we can match the end with the sequence of `---`.

Based on these observations, a regular expression for `Company Information` would be:

In [113]:
job_about_re = r"(?:(?:(?:[aA][bB][oO][uU][tT])|(?:_[iI][nN][fF][oO])).*:)((?:.*\n)*?)(?=---+)"

In [114]:
job_about = [
    re.findall(job_about_re, val)[0]
    if len(re.findall(job_about_re, val)) > 0
    else
        None
    for val in new_data
]

In [115]:
print([index for index, val in enumerate(job_about) if val == None][2:5])
print([index for index, val in enumerate(job_about) if val == None][1000:1005])
print([index for index, val in enumerate(job_about) if val == None][-3:])

[27, 43, 44]
[14139, 14148, 14172, 14173, 14178]
[32424, 32449, 32472]


In [116]:
len([_ for val in job_about if val == None])

2242

In [117]:
print(all_job_lists[14421])

ID: 59399
PROCEDURE: Interested candidates for this position should
submit the following:
A.  Application for Federal Employment (SF-171 or OF-612);  or
B.  A current resume
C.  Documentation (e.g., essays, certificates, awards, copies of degrees
earned) that address the minimum requirements of the position as listed
above.
SUBMIT APPLICATION TO
Human Resources Office 
Attention: Hasmik Melkonyan 
18 Baghramian Ave, Yerevan 375019, Armenia
POINT OF CONTACT
Name: Gohar Sargsyan 
Telephone: (374 1) 52-46-61
DEAD_LINE: 16 March 2014
START DATE: 31 July 2013
LOCATION: Yerevan, Armenia
REQUIRED QUALIFICATIONS:
 - Higher education, preferably in Humanities; students may apply as
well;
- Excellent communication skills, enthusiastic and proactive
personality;
- Excellent skills in Armenian language (knowledge of dialects as well as
distinct handwriting are preferable);
- Ability to work under pressure and within strict time frames;
- Ability to travel within Armenia for the scheduled dates;
- 

### 6.11 Job Qualifications

Finding possible feature names for job qualifications

In [118]:
find_feature_names(["qual"], possible_features)

Personal qualities
qualifications
QUALIFICATION
Desired Qualifications
Preferred qualifications
Other qualifications
Desired qualifications
qualified
Quality Control
Preferred qualification
Specific Qualifications
Quality control
REQ_QUALS
Quality Assurance
qualifications to
Qualifications
following qualifications
Main Qualifications
qualifications are
Professional qualities
Other Qualifications
REQUIRED QUALIFICATIONS
Academic qualifications
Desirable Qualifications
and Qualifications
Minimum Qualifications
QUALIFS
Qualification requirements
and qualifications
Preferred Qualifications
are qualified
Quality Improvement
Main qualifications
Academic Qualifications
Preferred Qualification
Quality Management
qualifications include
quality assurance
Specific qualifications
Recruitment qualifications
water quality
qualification to
Personal Qualities
Additional Qualifications
minimum qualifications
quality management


From the above list the variations for `Required Qualifications` possible feature names are:

1. REQUIRED QUALIFICATIONS
2. QUALIFS
3. REQ_QUALS
4. qualifications
5. QUALIFICATION
6. Qualifications

Using the same approach used for `Company Information`, removing this feature from each listing.

In [119]:
about_feature = [
    "ABOUT COMPANY",
    "ABOUT",
    "_info",
    "about_company",
    "COMPANYS_INFO"
]

In [120]:
new_data = remove_end(new_data, feature=about_feature)

In [121]:
print(new_data[0])

ID: 29711
JOB_PROC: Please submit your resumes to:http://tbe.taleo.net/NA6/ats/careers/requisition.jsp?org=QUESTRADE&cws=1&rid=223
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
DEAD_LINE: 14 October 2014, 18:00
JOB_T: Head of Corporate Customers Relationship Management Division
OPEN TO/
DATE_START: 12 July 2005
JOB_LOC: Yerevan, Armenia
REQUIRED QUALIFICATIONS:
 - Higher education;
- At least 2 years of work experience in programming.
REMUNERATION/
------------------------------


__Analysis:__

1. Multiline Feature
2. Variations of the feature name that includes the string `qual`, along with the different cases can be used as a part of the __re__.
3. Since `Qualifications` is the last feature for the new data, we can match the end with the sequence of `---`.

Based on these observations, a regular expression for `Qualifications` would be:

In [122]:
job_qualification_re = r"(?:(?:(?:REQ_QUALS)|(?:[qQ][uU][aA][lL][iI][fF])).*:)((?:.*\n)*?)(?=---+)"

__Regular Expression used:__ 

`(?:(?:(?:REQ_QUALS)|(?:[qQ][uU][aA][lL][iI][fF])).*:)((?:.*\n)*?)(?=---+)` <br/>
__Explanation:__ 

- Match strings containing `qualif` and `REQ_QUALS` in both cases as a part of the string followed by all characters till the last occurence of `: `
- Matching all characters thereafter within the same line using the expression `(.*\n)*`
- The end for each feature value is found using a positive look ahead by matching the finding the first occurence of the sequence of `---+` since it is now the last feature for the new data.

In [123]:
job_qualification = [
    re.findall(job_qualification_re, val)[0]
    if len(re.findall(job_qualification_re, val)) > 0
    else
        None
    for val in new_data
]

Each qualification has multiple bullet points, thus seperating each bullet point, and storing it in a multidimensional list.

In [124]:
job_qualification_ind = [
    re.split(r"- ", val)
    if val is not None
    else
        None
    for val in job_qualification
]

In [125]:
job_qualification_ind = [
    [
        v.strip()
        if val is not None
        else
            None
        for v in val
        if len(v.strip()) > 0
    ]
    if val is not None
    else
        []
    for val in job_qualification_ind
]

In [126]:
print([index for index, val in enumerate(job_qualification) if val == None][2:5])
print([index for index, val in enumerate(job_qualification) if val == None][1000:1005])
print([index for index, val in enumerate(job_qualification) if val == None][-3:])

[24, 36, 41]
[14067, 14072, 14074, 14111, 14126]
[32400, 32414, 32442]


In [127]:
print(all_job_lists[32442])

ID: 96655
JOB_PROC: If you are interested in this position, please
submit your application for the position of ""Consultant. MSE Finance in
Armenia"" online at: http://www.bfconsulting.org/submit_cv.php.
Please include your cover letter and information about your work
experience and education.
Please clearly mention in your application letter that you learned of
this job opportunity through Career Center and mention the URL of its
website - www.careercenter.am, Thanks.
DEAD_LINE: 11 May 2007
TITLES: Chief Accountant
DATES: 14 April 2014
LOCATION: Yerevan, Armenia
ABOUT COMPANY:
 Arge Business LLC is an Official Distributor of Proctor &
Gamble in
SALARY: Up to 400,000-600,000 AMD (based on education and
experience)
JOB_DESC: N/A
RESPONSIBILITY:
 - Coordinate documentation preparation and requirements for finalization
of customs clearance process in compliance with local regulations;
- Calculate budget for customs fees;
- Cooperate with internal and external entities in order to secure a

In [128]:
len([_ for val in job_qualification if val == None])

2322

In [129]:
for val in job_qualification_ind[39]:
    print(val)
    print()

Master's degree in a relevant field; MBA is preferred;

At least 3 years of work experience in a relevant field;

Ability to work under pressure; stress tolerance;

Outstanding interpersonal, negotiation and conflict resolution skills;

Critical thinking skills and attention to details;

Excellent knowledge of the Armenian, Russian and English languages;

Advanced user of MS Office and accounting software (1C or other
software).


REMUNERATION/



## 7. Converting to XML:

Creating listings tag

Finding the total job listings provided.

In [130]:
total_listings = len(all_job_lists)

In [131]:
print(total_listings)

32474


In [132]:
def create_id(index, fh):
    fh.write('<listing id="{}">'.format(job_id_int[index]))
    fh.write("\n")
    return 

def close_id(fh):
    fh.write('</listing>')
    fh.write("\n")
    return

__Method: $create\_id()$__

- Parameters: 
    - index[int]: The `index` for the given listing.
    - fh: File pointer for writing into the file.
    
- Write an opening tag for individual job listing for the given `index`

__Method: $close\_id()$__

- Parameters: 
    - fh: File pointer for writing into the file.
    
- Write the closing tag for individual listing into the file.

In [133]:
def create_tag(index, fh, list_name, tag_name, nested_list=None, nested_tag_name=None, nested=False):
    if nested:
        fh.write('<{}>'.format(tag_name))
        fh.write("\n")
        if list_name[index] is not None:
            for val in nested_list[index]:
                fh.write("<{}>".format(nested_tag_name))
                fh.write("\n")
                
                val = re.sub(r"&", r"&amp;", val)
                val = re.sub(r"<", r"&lt;", val)
                val = re.sub(r">", r"&gt;", val)
                
                fh.write("{}".format(val))
                fh.write("\n")
                fh.write("</{}>".format(nested_tag_name))
                fh.write("\n")
        else:
            fh.write("N/A")
            fh.write("\n")
    else:
        fh.write('<{}>'.format(tag_name))
        fh.write("\n")
        if list_name[index] is not None: 
            val = list_name[index]
            val = re.sub(r"&", r"&amp;", val)
            val = re.sub(r"<", r"&lt;", val)
            val = re.sub(r">", r"&gt;", val)
            
            fh.write("{}".format(val))
            fh.write("\n")
        else:
            fh.write("N/A")
            fh.write("\n")
    return

__Method: $create\_tag()$__

- Parameters: 
    - index[int]: The `index` for the given listing.
    - fh: File pointer for writing into the file.
    - list_name: The feature list from which the value must be pulled for the given `index`.
    - tag_name: The tag name for the given feature.
    - nested_list: In case of nested elements present, this parameter would have the contents for the nested elements.
    - nested_tag_name: The tag name for the nested elements.
    - nested: Boolean variable to check if nested elements are present or not.
    
- Creating a tag for the given feature along with the necessary data, if nested elements are present write them into the file as well, along with the closing tag.

While creating the XML file, things need to be taken into consideration are that 3 symbols `&`, `<` and `>` have special meanings, and hence must be converted to `&amp;`, `&lt` and `&gt` respectively.

To perform this operation the [$sub()$](https://docs.python.org/3.6/library/re.html#re.sub) (Python Software Foundation, 2018) module from re has been utilized, which finds the occurrence of the character and replaces with the substitute string.

In [134]:
def close_tag(fh, tag_name):
    fh.write('</{}>'.format(tag_name))
    fh.write("\n")
    return

__Method: $close\_tag()$__

- Parameters: 
    - fh: File pointer for writing into the file.
    - tag_name: The tag name for the given feature.
    
- Creating a closing tag for the given feature.

In [135]:
listing_counter = 0
with open("output.xml", encoding="utf-8", mode="w") as fh:
    fh.write('<?xml version="1.0" encoding="UTF-8"?>')
    fh.write("\n")
    fh.write("<listings>")
    fh.write("\n")
    
    # Order of the tags being written into the file.
    # Each tuple consists of the following fields:
    # 1. Feature list
    # 2. Feature name
    # 3. Nested element list
    # 4. Nested element name
    # 5. Nested element boolean flag
    tag_list = [
        (job_titles, "title", None, None, False),
        (job_locations, "location", None, None, False),
        (job_description, "job_descriptions", job_description_ind, "description", True),
        (job_responsibility, "job_responsibilities", job_responsibility_ind, "responsibility", True),
        (job_qualification, "required_qualifications", job_qualification_ind, "qualification", True),
        (job_salary, "salary", None, None, False),
        (job_procedures, "application_procedure", None, None, False),
        (job_dates, "start_date", None, None, False),
        (job_deadlines, "application_deadline", None, None, False),
        (job_about, "about_company", None, None, False)
    ]
    while listing_counter < total_listings:
        create_id(listing_counter, fh)
        for list_name, \
            tag_name, \
            nested_list, \
            nested_tag_name, \
            nested in tag_list:
            
            create_tag(listing_counter, fh, list_name, tag_name, nested_list, nested_tag_name, nested)
            close_tag(fh, tag_name)
        
        close_id(fh)
        listing_counter += 1
    fh.write("</listings>")
    print("XML File has been created.")

XML File has been created.


## 8. Converting to JSON

JSON file can be created a json object using the [$json.dumps()$](https://docs.python.org/3.6/library/json.html#json.dumps) (Python Software Foundation, 2018) module from the `json` package, which would convert the dictionary structure into a json object that can then be written into a file.

In [137]:
listing_counter = 0
job_listings = {}
with open("output.json", mode="w") as fh:
    job_listings["listings"] = {"listing":[]}
    
    # Iterating over all job listings 
    while listing_counter < total_listings:
        
        # Getting the list of nested elements for the 3 features based on
        # the assignment specifications. 
        # 1. Job Description
        # 2. Job Responsibilities
        # 3. Required Qualifications
        description = job_description_ind[listing_counter]
        responsibilities = job_responsibility_ind[listing_counter]
        qualifications = job_qualification_ind[listing_counter]
        
        # Creating the data element for individual job listing
        data = {
            "_id": str(job_id_int[listing_counter]),
            "title": job_titles[listing_counter],
            "location": job_locations[listing_counter],
            "job_description": {"description": description},
            "job_responsibilities": {"responsibility": responsibilities},
            "required_qualifications": {"qualification": qualifications},
            "salary": job_salary[listing_counter],
            "application_procedure": job_procedures[listing_counter],
            "start_date": job_dates[listing_counter],
            "application_deadline": job_deadlines[listing_counter],
            "about_company": job_about[listing_counter]
        }
        
        # Appending the data element into the dictionary
        job_listings["listings"]["listing"].append(data)
        listing_counter += 1
        
    # Dumping the dictionary element created into a json format element.
    job_data = json.dumps(job_listings)
    
    # Writing the json dump element into the file.
    fh.write(job_data)
    print("JSON File has been created.")

JSON File has been created.


## 9. References:

- Python Software Foundation (2018). *`sys` — System-specific parameters and functions*. Retrieved from https://docs.python.org/3.6/library/sys.html#module-sys

- Python Software Foundation (2018). *`os` — Miscellaneous operating system interfaces*. Retrieved from https://docs.python.org/3/library/os.html#module-os

- Python Software Foundation (2018). *`os` Documentation: os.listdir() module*. Retrieved from https://docs.python.org/3.6/library/os.html#os.listdir

- Python Software Foundation (2018). *`re` — Regular expression operations*. Retrieved from https://docs.python.org/3.6/library/re.html#module-re

- Python Software Foundation (2018). *`json` — JSON encoder and decoder*. Retrieved from https://docs.python.org/3.6/library/json.html#module-json

- Python Software Foundation (2018). *`re` Documentation: re.findall() module*. Retrieved from https://docs.python.org/3.6/library/re.html#re.findall

- Python Software Foundation (2018). *Built-in Functions: all() module*. https://docs.python.org/3.6/library/functions.html#all

- Python Software Foundation (2018). *`re` Documentation: re.split() module*. https://docs.python.org/3.6/library/re.html#re.split

- Python Software Foundation (2018). *`re` Documentation: re.sub() module*. Retrieved from https://docs.python.org/3.6/library/re.html#re.sub

- Python Software Foundation (2018). *`json` Documentation: json.dumps() module*. Retrieved from https://docs.python.org/3.6/library/json.html#json.dumps

- Pieters M. (2014, April 16). How to dynamically build a JSON object with Python? [Response to]. Retrieved from https://stackoverflow.com/a/23110401/6277438

- Jain R. (2013, August 31). How to frame two for loops in list comprehension python [Response to]. Retrieved from https://stackoverflow.com/a/18551476/6277438

- Pythex (2018). *Pythex*. Retrieved from https://pythex.org/