# Data Wrangling Template

https://www.kaggle.com/udacity/armenian-online-job-postings/

#### Armenian Online Job Postings
19,000 online job postings from 2004 to 2015 from Armenia's CareerCenter

#### Acknowledgements

The data collection and initial research were funded by the American University of Armenia’s research grant (2015).

Habet Madoyan, CEO at Datamotus, compiled this dataset and has granted us permission to republish. The republished dataset is identical to the original dataset, which can be found here. Datamotus also published a report detailing the text mining techniques used, plus analyses and visualizations of the data.

## Gather

In [1]:
import zipfile as zf
import pandas as pd

In [2]:
with zf.ZipFile('armenian-online-job-postings.zip','r') as myzip:
    myzip.extractall()

In [3]:
df = pd.read_csv('online-job-postings.csv')
df.head(2)

Unnamed: 0,jobpost,date,Title,Company,AnnouncementCode,Term,Eligibility,Audience,StartDate,Duration,...,Salary,ApplicationP,OpeningDate,Deadline,Notes,AboutC,Attach,Year,Month,IT
0,AMERIA Investment Consulting Company\r\nJOB TI...,"Jan 5, 2004",Chief Financial Officer,AMERIA Investment Consulting Company,,,,,,,...,,"To apply for this position, please submit a\r\...",,26 January 2004,,,,2004,1,False
1,International Research & Exchanges Board (IREX...,"Jan 7, 2004",Full-time Community Connections Intern (paid i...,International Research & Exchanges Board (IREX),,,,,,3 months,...,,Please submit a cover letter and resume to:\r\...,,12 January 2004,,The International Research & Exchanges Board (...,,2004,1,False


## Assess

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
jobpost             19001 non-null object
date                19001 non-null object
Title               18973 non-null object
Company             18994 non-null object
AnnouncementCode    1208 non-null object
Term                7676 non-null object
Eligibility         4930 non-null object
Audience            640 non-null object
StartDate           9675 non-null object
Duration            10798 non-null object
Location            18969 non-null object
JobDescription      15109 non-null object
JobRequirment       16479 non-null object
RequiredQual        18517 non-null object
Salary              9622 non-null object
ApplicationP        18941 non-null object
OpeningDate         18295 non-null object
Deadline            18936 non-null object
Notes               2211 non-null object
AboutC              12470 non-null object
Attach              1559 non-null object
Year              

In [5]:
df['Year'].value_counts()

2012    2149
2015    2009
2013    2009
2014    1983
2008    1785
2011    1697
2007    1538
2010    1511
2009    1191
2005    1138
2006    1116
2004     875
Name: Year, dtype: int64

In [6]:
sum(df.duplicated())

39

* Missing values (NaNs)
* StartDate inconsistencies
* Fix non discriptive column headers (ApplicationP, AboutC,...)

## Clean

#### Define
* StartDate inconsistencies - replace all relevant records with ASAP
* fix non discriptive column headers (ApplicationP, AboutC,...) - replace all relevant column headers with full word headers

#### Code

In [7]:
df_clean = df.copy()
df_clean = df_clean.rename(columns={'ApplicationP':'ApplicationProcedure',
                                   'AboutC':'AboutCompany',
                                   'RequiredQual':'RequiredQualifications',
                                   'JobRequirment':'JobRequirements'})

In [8]:
df_clean['StartDate'].value_counts()

ASAP                                                                                                                         4754
Immediately                                                                                                                   773
As soon as possible                                                                                                           543
Upon hiring                                                                                                                   261
Immediate                                                                                                                     259
Immediate employment                                                                                                          140
As soon as possible.                                                                                                           32
01 September 2012                                                                         

In [9]:
asap_list = ['Immediately', 'As soon as possible', 'Upon hiring',
             'Immediate', 'Immediate employment', 'As soon as possible.', 'Immediate job opportunity',
             '"Immediate employment, after passing the interview."',
             'ASAP preferred', 'Employment contract signature date',
             'Immediate employment opportunity', 'Immidiately', 'ASA',
             'Asap', '"The position is open immediately but has a flexible start date depending on the candidates earliest availability."',
             'Immediately upon agreement', '20 November 2014 or ASAP',
             'immediately', 'Immediatelly',
             '"Immediately upon selection or no later than November 15, 2009."',
             'Immediate job opening', 'Immediate hiring', 'Upon selection',
             'As soon as practical', 'Immadiate', 'As soon as posible',
             'Immediately with 2 months probation period',
             '12 November 2012 or ASAP', 'Immediate employment after passing the interview',
             'Immediately/ upon agreement', '01 September 2014 or ASAP',
             'Immediately or as per agreement', 'as soon as possible',
             'As soon as Possible', 'in the nearest future', 'immediate',
             '01 April 2014 or ASAP', 'Immidiatly', 'Urgent',
             'Immediate or earliest possible', 'Immediate hire',
             'Earliest  possible', 'ASAP with 3 months probation period.',
             'Immediate employment opportunity.', 'Immediate employment.',
             'Immidietly', 'Imminent', 'September 2014 or ASAP', 'Imediately']

for entry in asap_list:
    df_clean.StartDate.replace(to_replace=entry,value='ASAP', inplace=True)

#### Test

In [10]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19001 entries, 0 to 19000
Data columns (total 24 columns):
jobpost                   19001 non-null object
date                      19001 non-null object
Title                     18973 non-null object
Company                   18994 non-null object
AnnouncementCode          1208 non-null object
Term                      7676 non-null object
Eligibility               4930 non-null object
Audience                  640 non-null object
StartDate                 9675 non-null object
Duration                  10798 non-null object
Location                  18969 non-null object
JobDescription            15109 non-null object
JobRequirements           16479 non-null object
RequiredQualifications    18517 non-null object
Salary                    9622 non-null object
ApplicationProcedure      18941 non-null object
OpeningDate               18295 non-null object
Deadline                  18936 non-null object
Notes                     2211 non

In [11]:
df_clean.StartDate.value_counts()

ASAP                                                                                                                         6856
01 September 2012                                                                                                              31
March 2006                                                                                                                     27
November 2006                                                                                                                  22
January 2010                                                                                                                   19
01 February 2005                                                                                                               17
February 2014                                                                                                                  17
September 2010                                                                            

In [12]:
for entry in asap_list:
    assert entry not in df_clean.StartDate.values 