# Where the data science jobs are? (Part 2)

This post is a continuation of [Where the data science jobs are (part 1)](https://github.com/sedeh/github.io/blob/master/projects/where_the_data_science_related_jobs_are_part1.ipynb). In this installment, we're going to analyze the dataset we downloaded and cleaned in part 1. The dataset contains job information about non-U.S. workers. When a U.S. company wants to hire a foreign worker, the company is required to file a work visa (H1B) or permanent residencey (green card) application with the U.S. Department of Labor. As part of the application, the company must disclose how much U.S. workers in similar jobs are being paid. This is called `Prevailing Wage`. The dataset only contains H1B data.

We can take advantage of this information to gain insights into the general U.S. job market. Specifically, we're going to look at data science related jobs. Assuming the prevailing wage theory is true, U.S. companies hire foreign workers because there are no qualified U.S. applicants to fill the roles. So a company that hires foreign workers likely also hires many U.S. workers. After all, they only hired non-U.S. workers becauses they have exhausted the pool of qualified U.S. applicants.

In part 1, we downloaded, cleaned and enriched the dataset. Now, let's probe the dataset. The dataset contains the following fields that are relevant to our current analysis.

- `Submitted_Date`: Timestamp reflecting when the H1B application was received by the government
- `Employer Name`: Name of the U.S. company filing the H1B application
- `Work_State`: Full name of the state where the H1B job is located
- `Work_State_Code`: Two letter state abbreviation where the H1B job is located
- `Job_Category`: Unofficial job subcategory assigned to the Job Title listed on the application
- `Offered_Salary_Adjusted`: Annual salary offered to the foreign worker beneficiary of the H1B application
- `Prevailing_Salary_Adjusted`: Annual salary (prevailing wage) for similar jobs
- `Census_2015`: Population census for the year 2015

Annual salary has been adjusted to reflect regional inflation. 

## Load data

Here's a snapshot of the dataset.

In [1]:
# Load data
import pandas as pd
pd.options.mode.chained_assignment = None # For now, let's turn off panda's warning
dsJobs = pd.read_csv("dataScienceJobs.csv")
print("\n")
print(dsJobs.head())



        Submitted_Date                                      Employer_Name  \
0  2006-10-02 08:35:53                                   IMCS GROUP, INC.   
1  2006-10-02 08:53:34                       LG Electronics Alabama, Inc.   
2  2006-10-02 09:04:25  Seacoast National Bank (First National Bank &t...   
3  2006-10-02 09:16:18                             Seacoast National Bank   
4  2006-10-02 10:11:06                       Northern Illinois University   

   Work_City Work_State    Job_Category Work_State_Code  Price_Deflator  \
0  Santa Ana      Texas  market analyst              TX          102.90   
1  Ft. Worth    Alabama  market analyst              AL           93.65   
2     Stuart    Florida  market analyst              FL          105.35   
3     Stuart    Florida  market analyst              FL          105.35   
4     DeKalb   Illinois    data analyst              IL          107.50   

   Offered_Salary_Adjusted  Prevailing_Salary_Adjusted  Census_2015  
0             

## Query data

Let's find out the Top 10 states for data science related H1B jobs.

In [2]:
data = dsJobs[["Work_State_Code", "Census_2015"]]
data["Job_Per_10000"] = 10000 * (1 / data["Census_2015"])
data = data.groupby(['Work_State_Code']).sum()
state_data = data.reset_index()
state_data.sort_values(by="Job_Per_10000", ascending=False, inplace=True)
print("\n")
print(state_data.head(10).reset_index(drop=True))



  Work_State_Code   Census_2015  Job_Per_10000
0              DC  1.181105e+09      26.136965
1              NJ  1.955982e+11      24.374825
2              DE  1.555115e+09      17.379648
3              NY  5.162544e+11      13.174013
4              VA  8.342755e+10      11.871655
5              MA  5.309841e+10      11.502082
6              CT  1.447486e+10      11.225642
7              CA  1.499599e+12       9.786481
8              IL  1.388751e+11       8.397359
9              MD  2.644018e+10       7.328848


Except for Illinois (IL), all of these states are located either in the East Coast or the West Coast. In fact, East Coast states dominate.  

Let's see the states that hold up the bottom end.

In [3]:
print("\n")
print(state_data.tail(10).reset_index(drop=True))



  Work_State_Code  Census_2015  Job_Per_10000
0              MS  511688943.0       0.571460
1              PW      20918.0       0.478057
2              WV  149374368.0       0.439232
3              PR  455117842.0       0.377067
4              WY   12894354.0       0.375358
5              AS     111038.0       0.360237
6              MT   35120266.0       0.329155
7              FM     207098.0       0.193145
8              MH      52634.0       0.189991
9              MP      53883.0       0.185587


It seems that U.S. territories dominate here.

## Map data

Let's make an interactive D3.js graph in `plotly`.

In [4]:
# Learn about API authentication here: https://plot.ly/python/getting-started
# Find your api_key here: https://plot.ly/settings/api
import pandas as pd
import plotly.plotly as py
import plotly.tools as tls
tls.set_credentials_file(username='samueledeh', api_key='2spdso18wk')

scl = [[0.0, 'rgb(242,240,247)'],[0.2, 'rgb(218,218,235)'],[0.4, 'rgb(188,189,220)'],\
            [0.6, 'rgb(158,154,200)'],[0.8, 'rgb(117,107,177)'],[1.0, 'rgb(84,39,143)']]
    
data = [dict(
    type='choropleth',
    colorscale = scl,
    autocolorscale = False,
    locations = state_data.Work_State_Code,
    z = state_data['Job_Per_10000'].astype(float),
    locationmode = 'USA-states',
    text = state_data.Work_State_Code,
    hoverinfo = 'location+z',
    marker = dict(
        line = dict (
            color = 'rgb(255,255,255)',
            width = 2
        )
    ),
    colorbar = dict(
        title = "# of H1B Jobs Per 10,000 inhabitants"
    )
)]

layout = dict(
    title = 'Data science related Jobs<br>(Hover for breakdown)',
    geo = dict(
        scope='usa',
        projection=dict( type='albers usa' ),
        showlakes = True,
        lakecolor = 'rgb(255, 255, 255)'
    )
)
    
fig = dict(data=data, layout=layout)

py.iplot(fig, validate=False, filename='ds-jobs-map')

Based on above plot (also available [here](https://plot.ly/2/~samueledeh/)), New Jersey is the hottest state for data science related H1B jobs followed by Delaware. New York is also very strong, finishing well ahead of California. In general, it seems that data science related H1B jobs are concentrated around the coasts of the U.S. With the exception of Illinois, Texas and Minnesota, the middle part of America seems relatively barren as far as data science related H1B jobs are concerned.  

> ### The U.S. Coasts are the hotbeds for data science related jobs

Are you in the job market for a data science related job? If so, well, you may have just gotten some relocation ideas lined up for you! Before you start packing though, let's finish this story in Tableau and find out the [Top paying states and companies for data science related jobs](https://public.tableau.com/views/top_paying_states_for_data_science/Story1?:embed=y&:display_count=yes&:showTabs=y). 