[Install](https://conda.io/docs/user-guide/install/index.html) Jupyter Notebook.

# $$Python\ Workshop\ 2019$$

Tədbirin gündəliyi:
* Kaggle nədir və Kaggle vasitəsilə neçə data elmini öyrənmək olar;
* Məlumat analizində Python-nun yeri;
* Real nümunə üzərində məlumat analizi;

Real nümunə haqqında:
* **`Data collection / web crawling`** (məlumat toplanması) - ən böyük onlayn iş elanı veb səhifəsindən məlumatların avtomatik götürülməsi və faylda saxlanılması;
* **`Data cleaning`** (məlumat təmizlənməsi) - dəyişənlərin tipinin dəyişdirilməsi, catışmayan məlumatların bərpa olunması, artıq sətrlərin təmizlənməsi;
* **`Feature generation`** (yeni dəyişənlərin yaradılması) - əldə olan məlumatlardan yenisinin yaradılması, misal üçün: elanın mətnindən hansı dildə yazılması;
* **`Data visualizatoin`** (məlumatların təsviri) - məlumatların vizual təqdim olunması;

In [1]:
'#'.join(ch for ch in 'python')

'p#y#t#h#o#n'

In [2]:
list('python')

['p', 'y', 't', 'h', 'o', 'n']

In [3]:
for ch in 'python':
    print(ch)

p
y
t
h
o
n


## $$The\ Zen\ of\ Python$$

In [4]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


Understanding `The Zen of Python` via [examples](https://artifex.org/~hblanks/talks/2011/pep20_by_example.html).

## $$1.\ Import\ packages$$

In [5]:
%matplotlib inline

import requests
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import langdetect

from wordcloud import WordCloud

## $$2.\ Import\ data\ from\ website$$

In [6]:
# if `jobs.csv` exist don't extract data from website

file_name = 'jobs.csv'
url = 'http://jobsearch.az/'

if not os.path.isfile(file_name):
    print("File doesn't exit. Extract from website.")
    html = requests.get(url).content
    df_list = pd.read_html(html)
    jobs = df_list[9]
    # export to csv
    jobs.to_csv('jobs.csv', index=False)
else:
    print("File exist. Read from file.")
    jobs = pd.read_csv(file_name)

File doesn't exit. Extract from website.


Remove first column and put first row as column.

In [7]:
# jobs.drop(columns=[jobs.columns[0]], axis=1, inplace=True) # remove first column
jobs.columns = jobs.iloc[0, :] # rename column to first row
jobs = jobs.iloc[1:, :] # remove first row

In [8]:
jobs.sample(n=30)

Unnamed: 0,POSITION,EMPLOYER,POSTED,DEADLINE
373,Translator,KPMG Azerbaijan,2018-12-27,2019-01-26
243,Sales Manager in Tourism (new),Company,2019-01-07,2019-02-06
127,Satış meneceri (new),Fil Agency,2019-01-09,2019-02-08
335,B2B Sales Manager,Gulfstream Distribution,2018-12-28,2019-01-27
490,Restaurant Supervisor,"Fairmont Baku, Hotels & Residences",2018-12-21,2019-01-20
88,Business consultant (new),RSM Azerbaijan,2019-01-10,2019-02-09
172,Korporativ Hüquqşünas (new),Unibank,2019-01-08,2019-02-07
25,"Korporativ sığorta şöbəsi, assistent (new)",PASHA Life,2019-01-11,2019-02-10
79,Sales Executive (new),Hotel,2019-01-10,2019-02-09
175,IT Specialist (new),PDL World,2019-01-08,2019-02-07


## $$3.\ Data\ cleaning$$

In [9]:
print('Dimension of the dataset :', jobs.shape)
jobs.head()

Dimension of the dataset : (691, 4)


Unnamed: 0,POSITION,EMPLOYER,POSTED,DEADLINE
1,Специалист по продаже финансовых услуг (new),Invest AZ QSC,2019-01-12,2019-02-11
2,Satış mühəndisi (new),Bauger MMC,2019-01-11,2019-02-10
3,Texniki nəzarət üzrə mütəxəssis (new),Metak,2019-01-11,2019-02-10
4,Tibbi nümayəndə (new),Profderma,2019-01-11,2019-02-10
5,Alətçi-çilingər (new),Metak,2019-01-11,2019-02-10


Number of missing values per column.

In [10]:
jobs.isnull().sum()

0
POSITION    0
EMPLOYER    0
POSTED      0
DEADLINE    0
dtype: int64

Get types per column.

In [11]:
jobs.get_dtype_counts()

object    4
dtype: int64

## $$4.\ Feature\ generation$$

Dummy variable to seperate new and old features.

In [12]:
def is_new_job(pos):
    '''Return dummy variable about '''
    return 'New' if pos.split()[-1] == '(new)' else 'Old'

In [13]:
jobs = jobs.assign(is_new = jobs.POSITION.apply(lambda x: is_new_job(x)))

Remove `(new)` tag from job position.

In [14]:
def clean_position(pos):
    '''Cleaning Positions'''
    pos = pos.lower()
    # Without loop :)
    pos_clean = ' '.join(pos.split()[0:-1])
    #pos_clean = ' '.join(w for w in pos.split() if w != '(new)')
    pos_clean = pos_clean.strip()
    
    return pos_clean

In [15]:
jobs.head()

Unnamed: 0,POSITION,EMPLOYER,POSTED,DEADLINE,is_new
1,Специалист по продаже финансовых услуг (new),Invest AZ QSC,2019-01-12,2019-02-11,New
2,Satış mühəndisi (new),Bauger MMC,2019-01-11,2019-02-10,New
3,Texniki nəzarət üzrə mütəxəssis (new),Metak,2019-01-11,2019-02-10,New
4,Tibbi nümayəndə (new),Profderma,2019-01-11,2019-02-10,New
5,Alətçi-çilingər (new),Metak,2019-01-11,2019-02-10,New


In [16]:
jobs = jobs.assign(POSITION = jobs.POSITION.apply(lambda x: clean_position(x)))

In [17]:
jobs.head()

Unnamed: 0,POSITION,EMPLOYER,POSTED,DEADLINE,is_new
1,специалист по продаже финансовых услуг,Invest AZ QSC,2019-01-12,2019-02-11,New
2,satış mühəndisi,Bauger MMC,2019-01-11,2019-02-10,New
3,texniki nəzarət üzrə mütəxəssis,Metak,2019-01-11,2019-02-10,New
4,tibbi nümayəndə,Profderma,2019-01-11,2019-02-10,New
5,alətçi-çilingər,Metak,2019-01-11,2019-02-10,New


Convert string to date type. Add datetime variables:

* Weekday
* Month

In [18]:
jobs.POSTED = pd.to_datetime(jobs.POSTED) # convert to date type
jobs.DEADLINE = pd.to_datetime(jobs.DEADLINE) # convert to date type

month_name = {
    1 : 'Yanvar',
    2 : 'Fevral',
    3 : 'Mart',
    4 : 'Aprel',
    5 : 'May',
    6 : 'Iyun',
    7 : 'Iyul',
    8 : 'Avqust',
    9 : 'Sentyabr',
    10 : 'Oktyabr',
    11 : 'Noyabr',
    12 : 'Dekabr'
}

jobs = jobs.assign(posted_month = jobs.POSTED.dt.month.map(month_name))
jobs = jobs.assign(posted_weekday = jobs.POSTED.dt.weekday_name)

Find out `position` language, in which language the position was published.

In [19]:
def convert_language(lang):
    '''Convert language name and group them'''
    if lang == 'tr':
        return 'az'
    elif lang in ['en', 'ru']:
        return lang
    else:
        return 'other'

In [21]:
jobs = jobs.assign(position_lang = jobs.POSITION.apply(lambda x: langdetect.detect(x)))
jobs = jobs.assign(position_lang = jobs['position_lang'].map(convert_language))

LangDetectException: No features in text.

In [None]:
jobs.head()

In [None]:
jobs.position_lang.value_counts(normalize=True) * 100

In [None]:
jobs[jobs.position_lang == 'other'].head()

Get type of `employer`

In [None]:
# get last word from EMPLOYER
emp_type = jobs.EMPLOYER.apply(lambda x: x.split()[-1])

In [None]:
emp_type.value_counts(normalize=False).head(10) # top 10 employer type

We have the most popular employer type.

In [None]:
def get_employer_type(employer):
    '''Get employer type'''
    emp_types = ['LLC', 'MMC', 'ASC', 'Group', 'Holding', 'Bank', 'LTD']
    last_word = employer.split()[-1]
    return last_word if last_word in emp_types else 'Other'

In [None]:
jobs = jobs.assign(employer_type = jobs.EMPLOYER.apply(lambda x: get_employer_type(x)))

In [None]:
jobs.employer_type.value_counts(normalize=True) * 100

## $$5.\ Visualization$$

In [None]:
jobs.head()

Timeline of job publishment.

In [None]:
fig, ax = plt.subplots(figsize=(15,7))
jobs.groupby(['POSTED'])['POSTED'].count().plot(ax=ax)

In [None]:
jobs.POSTED.value_counts().head() # top 6 date with number of job ads

The most frequent `Employer` with jobs ads.

In [None]:
jobs.EMPLOYER.value_counts().head(10)

The most frequent `Position`.

In [None]:
jobs.POSITION.value_counts().head(10)

When the most popular day to publish job description.

In [None]:
sns.countplot(y=jobs.posted_weekday, 
              order = jobs.posted_weekday.value_counts().index)

Frequency of the job insertion.

In [None]:
sns.countplot(jobs.is_new)

Frequency of the `employer type`

In [None]:
sns.countplot(y=jobs.employer_type,
              order=jobs.employer_type.value_counts().index)

Frequency of the `position language`

In [None]:
sns.countplot(y=jobs.position_lang,
              order=jobs.position_lang.value_counts().index)

In [None]:
jobs.head()

## Word Cloud of Position

In [None]:
def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        max_words=75,
        max_font_size=40, 
        scale=7,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(str(data))

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

### POSITION

`AZ`

### `long version`

In [None]:
filter_by_lang = jobs.POSITION[jobs.position_lang == 'az']
text = []

# seperate each word from `position`
for sen in filter_by_lang.str.split(): # for each sentence
    for w in sen: # for each word
        text.append(w)

# combine into a string
text = ' '.join(text)

# word cloud
show_wordcloud(text)

### `short version`

In [None]:
text = ' '.join(w for sen in jobs.POSITION[jobs.position_lang == 'az'].str.split() for w in sen)
show_wordcloud(text)

`EN`

In [None]:
text = ' '.join(w for sen in jobs.POSITION[jobs.position_lang == 'en'].str.split() for w in sen)
show_wordcloud(text)

`RU`

In [None]:
text = ' '.join(w for sen in jobs.POSITION[jobs.position_lang == 'ru'].str.split() for w in sen)
show_wordcloud(text)

`OTHER`

In [None]:
text = ' '.join(w for sen in jobs.POSITION[jobs.position_lang == 'other'].str.split() for w in sen)
show_wordcloud(text)

## EMPLOYER

In [None]:
text = ' '.join(w for sen in jobs.EMPLOYER.str.split() for w in sen)
show_wordcloud(text)