First, let's take a look at the data, and see how we might answer the first question about how to break into the field of becoming a data scientist according to the survey results.

To get started, let's read in the necessary libraries we will need to wrangle our data: pandas and numpy.  If we decided to build some basic plots, matplotlib might prove useful as well.

In [None]:
import numpy as np
import pandas as pd
from collections import defaultdict
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Multiple choice dataframe
mc_df = pd.read_csv('./datasets/multiple_choice_responses.csv')
mc_df.head()

As we can see, people of various professional backgrounds answered the Kaggle survey

In [None]:
mc_df.Q5.value_counts()

We also can see that there we some people who responded who are not employed or not employed full time, which we can infer from their reported salary

In [None]:
mc_df.Q11.value_counts()

Let's narrow it down to just data scientists who are employed

In [None]:
# Only data scientists
ds_df = mc_df[mc_df['Q5'] .isin(['Data Scientist', 'Statistician']) ]
# Only those are are employed and make over $1000/yr
ds_df = ds_df[~ds_df['Q11'].isin(['$0 (USD)', '$1-$99', '$100-$999', ''])] 

ds_df

After seeing the salary values of the cohort, we can see that there are a pecular amount of data scientists who make less than $10,000 which is considered below the poverty line in the United States. This doesn't seem right.

In [None]:
ds_df[ds_df.Q11.notnull()].Q11.value_counts()

Let's check to see if this is dictated by the country of the correspondent 

In [None]:
ds_df[ds_df.Q11 == '$1000-$9,999'].Q3.value_counts()

Most correspondents who answer less than $10,000 are from India. What salary did most Indian correspondence answers and is it similar to the United States*?

<sub>*We are getting finding based on the assumption that our readers are American</sub>

In [None]:
ds_df[ds_df['Q3'] == 'India' ].Q11.value_counts(normalize=True)

In [None]:
ds_df[ds_df['Q3'] == 'United States of America' ].Q11.value_counts(normalize=True)

As we can see, the United States and India are not comparable in their salary distribution. Our audience for this analysis are typically American, therefore, analysis based on salaries in the United States and similar countries might be helpful. 

We can narrow this down by only evaluating data from the top 30 countries who are classified as having "Very high human development" based on the HDI [(more info on HDI can be found here)](https://en.wikipedia.org/wiki/List_of_countries_by_Human_Development_Index#Very_high_human_development).

In [None]:
high_hdi_countries = ['Norway', 'Switzerland', 'Australia', 'Ireland', 'Germany', 'Iceland', 'Hong Kong', 
                      'Sweden', 'Singapore', 'Netherlands', 'Denmark', 'Canada', 'United States of America', 'United Kingdom',
                      'Finland', 'New Zealand', 'Belgium', 'Liechtenstein', 'Japan', 'Austria', 'Luxembourg',
                      'Israel', 'South Korea', 'France', 'Slovenia', 'Spain', 'Czech Republic', 'Italy', 'Malta',
                      'Estonia']

# Only capture top 30 HDI countries
ds_df = ds_df[ds_df['Q3'].isin(high_hdi_countries)]

In [None]:
ds_df.Q3.value_counts(normalize=True)