---
# Data Related Job Descriptions Across 10 Countries

This data was scraped from Indeed across 10 countries:

- Australia
- Canada
- France
- Hong Kong
- Japan
- Singapore
- South Africa
- Switzerland
- United Kingdom
- United States

Three roles were searched within each countries Indeed: 

- Data Scientist
- Data Analyst
- Data Engineer
---

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns
from word_list import prog_lang, analysis, machine_learning, database, cloud, edu, big_data, lang, other, healthcare, stats
pd.set_option('display.max_rows', 10000)
pd.options.mode.chained_assignment = None

SyntaxError: invalid syntax (<ipython-input-2-9407d13140f7>, line 6)

---
## Importing Word Count Data

After the job postings were scrapped each job description was passed through a function counting the instances of predefined words. 

`skills_role_country_total.csv` contains the word counts across all countries and search terms. 

Below data frame is filtered based words usually associated with the following categories:
- programming languages
- Analysis
- Machine Learning
- Database
- Education 
- Big Data 
- Languages
- Health Care
- Math and Statistics

Finally grouping by `Search Term` and `Country` creating a function `top_count_by_term_country` that accepts a list of DataFrames returning the `top_n` `NumPostings` per group.

In [None]:
df = pd.read_csv('./job-stats/word-count/skills_role_country_total.csv')

In [None]:
df.drop(columns='Percentage', inplace=True)

In [None]:
df_prog_lang = df[df['Word'].isin(prog_lang)].reset_index(drop=True)
df_analysis = df[df['Word'].isin(analysis)].reset_index(drop=True)
df_machine_learning = df[df['Word'].isin(machine_learning)].reset_index(drop=True)
df_database = df[df['Word'].isin(database)].reset_index(drop=True)
df_edu = df[df['Word'].isin(edu)].reset_index(drop=True)
df_big_data = df[df['Word'].isin(big_data)].reset_index(drop=True)
df_lang = df[df['Word'].isin(lang)].reset_index(drop=True)
df_healthcare = df[df['Word'].isin(healthcare)].reset_index(drop=True)
df_stats = df[df['Word'].isin(stats)].reset_index(drop=True)

filtered_skills = [df_prog_lang, df_analysis, df_machine_learning, df_database, df_edu, 
                  df_big_data, df_lang, df_healthcare, df_stats]

In [None]:
def top_count_by_term_country(dataframe_list, top_n):
    top_list = []
    for frame in dataframe_list:
        data = frame.groupby(['Search Term', 'Country']).apply(lambda grp: grp.nlargest(top_n, 'NumPostings')).droplevel(level=2)
        data = data.drop(columns=['Search Term', 'Country'])
        top_list.append(data)
    return top_list

filtered_df = top_count_by_term_country(filtered_skills, 10)

df_prog = filtered_df[0]
df_analysis = filtered_df[1]
df_ml = filtered_df[2]
df_db = filtered_df[3]
df_ed = filtered_df[4]
df_bd = filtered_df[5]
df_la = filtered_df[6]
df_hc = filtered_df[7]
df_st = filtered_df[8]

---
## Top 10 counted words per role and country

In [None]:
top_10 = df.groupby(['Search Term', 'Country']).apply(lambda grp: grp.nlargest(10, 'NumPostings')) \
.droplevel(level=2) \
.drop(columns=['Search Term', 'Country'])

top_10

---
## Top 10 counted words in "Programming Languages" phrases

In [None]:
df_prog

---
## Top 10 counted words in "Analysis" phrases

In [None]:
df_analysis

---
## Top 10 counted words in "Machine Learning" phrases

In [None]:
df_ml

---
## Top 10 counted words in "Database" phrases

In [None]:
df_db

---
## Top 10 counted words in "Education" phrases

In [None]:
df_ed

---
## Top 10 counted words in "Big Data" phrases

In [None]:
df_bd

---
## Top 10 counted words in "Language" phrases

In [None]:
df_la

---
## Top 10 counted words in "Health Care" phrases

In [None]:
df_hc

---
## Top 10 counted words in "Stats" phrases

In [None]:
df_st

---
# Most Counted Terms

Most counted terms across all search terms and countries:

In [None]:
top_terms = df.groupby(['Word'])['NumPostings'].sum().sort_values(ascending=False)

In [None]:
top_terms.head(300)

---
# Most Counted Terms Per Group

Most counted terms in each job roll, across all 10 countries:

---
## Most counted terms related to "Programming Languages"

In [None]:
df_prog_lang.groupby(['Word'])['NumPostings'].sum().sort_values(ascending=False)

---
## Most counted terms related to "Data Analysis"

In [None]:
df_analysis.groupby(['Word'])['NumPostings'].sum().sort_values(ascending=False)

---
## Most counted terms related to "Machine Learning"

In [None]:
df_machine_learning.groupby(['Word'])['NumPostings'].sum().sort_values(ascending=False)

---
## Most counted terms related to "Databases"

In [None]:
df_database.groupby(['Word'])['NumPostings'].sum().sort_values(ascending=False)

---
## Most counted terms related to "Education"

In [None]:
df_edu.groupby(['Word'])['NumPostings'].sum().sort_values(ascending=False)

---
## Most counted terms related to "Big Data"

In [None]:
df_big_data.groupby(['Word'])['NumPostings'].sum().sort_values(ascending=False)

---
## Most counted terms related to "Human Languages"

In [None]:
df_lang.groupby(['Word'])['NumPostings'].sum().sort_values(ascending=False)

---
## Most counted terms related to "Health Care"

In [None]:
df_healthcare.groupby(['Word'])['NumPostings'].sum().sort_values(ascending=False)

---
## Most counted terms related to "Math / Statistics / Probability"

In [None]:
df_stats.groupby(['Word'])['NumPostings'].sum().sort_values(ascending=False)

---
# Word Count Exploration



In [None]:
top_terms['Analytical'] += top_terms['Analytics']

In [None]:
top_terms.drop('Analytics', inplace=True)

In [None]:
top_10_words = list(top_terms.head(10).index)
top_10_words

In [None]:
df_top_10 = df[df['Word'].isin(top_20_words)]
df_top_10

In [None]:
plt.figure(figsize=(20, 10))
sns.scatterplot(data=df_top_10.sort_values(by='NumPostings'), x="Word", y="Country", hue="NumPostings", size='NumPostings', sizes=(200, 2000))