# Once upon a time a student who wanted to work as data scientist

The student was in his last year of engineering studies and wanted to know more about his future job. The 2019 Kaggle ML & DS Survey was the perfect opportunity to help him to know more about the state of Machine Learning and data science.

He started by reading all the questions of the survey in order to select the most interesting ones.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.context('seaborn')
import seaborn as sns
sns.set_style("darkgrid")
sns.set_palette('Set2')

# Skip the first row in order to not take questions, only answers
df = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv', skiprows=[1])

# Get array of questions
questions = pd.read_csv('/kaggle/input/kaggle-survey-2019/questions_only.csv')
questions = questions.loc[0]

for i in range(1, 34):
    print('{}- {}'.format(i,questions[i]), end='\n\n')

After he had a look of the different questions of the survey, he decided to see what are the different roles in the Machine Learning industry.

In [None]:
# Return labels and counts for unique choice questions
def getDataFromUniqueAnswers(df, question):
    return df['Q{}'.format(question)].value_counts().index, df['Q{}'.format(question)].value_counts().values

# Return labels and counts for multiple choices questions
def getDataFromMultipleAnswers(df, question, numOfChoices):
    answers = [[df['Q{}_Part_{}'.format(question, i)].dropna().unique()[0],
            df['Q{}_Part_{}'.format(question, i)].count()]
            for i in range(1, numOfChoices + 1)]
    answers.sort(key = lambda x: x[1], reverse = True)
    return [answer[0] for answer in answers], [answer[1] for answer in answers]

# Return array of strings with the label and the pourcentage
def getNamePlusPourcentage(labels, counts):
    return ['{} ({:.2f}%)'.format(label, count / sum(y) * 100) for label, count in zip(labels, counts)]

x, y = getDataFromUniqueAnswers(df, 5)
fig, ax = plt.subplots(figsize=(8, 8))
ax.pie(y, labels=x, autopct='%1.f%%')
ax.set_title(questions[5], fontdict={'fontweight': 'bold'})
plt.show()

The main role was Data Scientist, and many students answered the survey. The student wanted to work as data scientist so he decided to have more information about the main activities of today's data scientists.

In [None]:
dfDataScientists = df[df['Q5'] == 'Data Scientist']

x, y = getDataFromMultipleAnswers(dfDataScientists, 9, 8)
fig, ax = plt.subplots(figsize=(7, 7))
ax.set_title(questions[9] + '\n', fontdict={'fontweight': 'bold'}, loc='right')
ax = sns.barplot(y, x, orient='h')

In the student opinion, a data scientist must analyze a large amount of data in order to make the best strategic decisions for the company. He was happy to see that the top answer was "Analyze and understand data to influence product or business decisions", which confirm the idea he had of the job. But the next answers are also importants, so a data scientist has several important activities.

Then, we wanted to confirm that the programming language the most used by data scientists was Python.

In [None]:
x, y = getDataFromMultipleAnswers(dfDataScientists, 18, 12)
fig, ax = plt.subplots(figsize=(15, 7))
ax.set_title(questions[18] + '\n', fontdict={'fontweight': 'bold'})
ax = sns.barplot(x, y, orient='v')

He did not know that SQL was used a lot by data scientists. So he was glad to have good SQL basis thanks to his web development activities.

He also wanted to see if the popularity of Python for data scientists was confirmed by the programming language recommendations.

In [None]:
x, y = getDataFromUniqueAnswers(dfDataScientists, 19)
fig, ax = plt.subplots(figsize=(7, 7))
ax.pie(y)
ax.set_title(questions[19] + '\n', fontdict={'fontweight': 'bold'})
plt.legend(getNamePlusPourcentage(x, y), loc="upper left")
plt.show()

He was not suprised to see that more that 75% of data scientists recommend to learn Python first.

Then, we wanted to see what are the most popular IDEs. Usually, he was using Jupyter Notebooks and Spyder. But he saw a lot of people critisizing Jupyter IDEs and did not really like Spyder.

In [None]:
x, y = getDataFromMultipleAnswers(dfDataScientists, 16, 12)
fig, ax = plt.subplots(figsize=(12, 7))
ax.set_title(questions[16] + '\n', fontdict={'fontweight': 'bold'}, loc='right')
ax = sns.barplot(y, x, orient='h')

He saw that despite the critics, Jupyter IDEs was really popular. PyCharm and Visual Code are more popular than Sypder. He was using Visual Code for web development so he decided to try it for Python development too.

He also wanted to see what is the primary tool used by data scientists to analyze data.

In [None]:
x, y = getDataFromUniqueAnswers(dfDataScientists, 14)
fig, ax = plt.subplots(figsize=(10, 7))
ax.set_title(questions[14] + '\n', fontdict={'fontweight': 'bold'}, loc='right')
ax = sns.barplot(y, x, orient='h')

He was surprised to see that the majority of data scientists was using local development environments to directly analyze data.

Then, he decided to see what are the Machine Learning algorithms the most used. This question was interesting because we wanted to study more the most used algorithms.

In [None]:
x, y = getDataFromMultipleAnswers(dfDataScientists, 24, 12)
fig, ax = plt.subplots(figsize=(8, 8))
ax.pie(y, labels=x, autopct='%1.f%%')
ax.set_title(questions[24], fontdict={'fontweight': 'bold'})
plt.show()

Linear or Logistic Regression, Decision Trees or Random Forests and Gradient Boosting Machines each got more than 15% of the answers. But Neural Networks also get many answers if we regroup different kinds of Neural Networks. So the student knew that these 4 categories of algorithms was the most used. But these answers confirmed that there is no one Machine Learning model to master in order to be a good data scientist, but a data scientist has to know several models and use the one appropriate to his current problem.

Now, the student knew more about the daily life of a data scientist. Then, he decided to see how they are following data science actualities. The student was mostly following data science actualities thanks to Towards Data Science and Medium.

In [None]:
x, y = getDataFromMultipleAnswers(dfDataScientists, 12, 12)
fig, ax = plt.subplots(figsize=(10, 7))
ax.set_title(questions[12] + '\n', fontdict={'fontweight': 'bold'}, loc='right')
ax = sns.barplot(y, x, orient='h')

He saw that many data scientist are also following actualities thanks to some blogs. But he discovered that Kaggle was also an important news source, so he decided to visit Kaggle's webisite more often.

Then, he wanted to see what are the most popular platforms to complete data sciences courses. Currently, he is only following Machine Learning courses in his engineering school.

In [None]:
x, y = getDataFromMultipleAnswers(dfDataScientists, 13, 12)
fig, ax = plt.subplots(figsize=(8, 8))
ax.pie(y, labels=x, autopct='%1.f%%')
ax.set_title(questions[13], fontdict={'fontweight': 'bold'})
plt.show()

He was suprised to see that University courses was not the top answer. Maybe it is because data scientist is a relatively new job, so current data sientists did not have Machine Learning courses during their studies. The student decided to create a Coursera account because this platform is very popular with data scientists.

To finish, the student analyzed the yearly compensation of data scientists. He was French and wanted to work in France so he decided to take into account only French data scientists because salaries vary greatly by country. In France, the minimum annual gross salary for a full time job is approximately €18,000, which is almost \$20,000, so he kept only compensations above this value.

In [None]:
dfFrenchDataScientists = dfDataScientists.query('Q3 == "France"')
x, y = dfFrenchDataScientists['Q10'].value_counts().index, dfFrenchDataScientists['Q10'].value_counts().values

# Create a new array in order to sort the axis by compensation
axis = list(map(lambda index: (int(index.split('-')[0].replace(',', '').replace('$', '').replace('>', '')), index), x))
axis = [(index, count) for index, count in zip(axis, y)]
axis.sort()

# Recreate x and y sorted with only compensations above $20,000
x = [row[0][1] for row in axis if row[0][0]>=20000]
y = [row[1] for row in axis if row[0][0]>=20000]

fig, ax = plt.subplots(figsize=(20, 7))
ax.set_title(questions[10] + '\n', fontdict={'fontweight': 'bold'})
plt.xticks(rotation=45)
ax = sns.barplot(x, y)

He saw that many French data scientists earn between \$40,000 and \$50,000, then the number decreases when the compensation increases.

He wanted to go further in the analysis of the compensations, and he decided to see if it is dependant of the size of the company. He took the answers of all data scientists because the number of French data scientist's answers is too low to be representative.

In [None]:
# Group by compensations and company sizes
dfGrouped = dfDataScientists.groupby(['Q10', 'Q6']).size().unstack()
dfGrouped.loc[x[0]]

# Create new df with salary sorted and above $20,000
newDf = pd.DataFrame()
for index, label in enumerate(x):
    newDf = newDf.append(dfGrouped.loc[label])

# Sort the columns
cols = ['0-49 employees', '50-249 employees', '250-999 employees', '1000-9,999 employees', '> 10,000 employees']
newDf = newDf[cols]

# Create a new column with the total for each compensation
newDf['sum'] = newDf.sum(axis=1)
 
# Divide by the total in order to get the part of every company sizes for each compensation
newDf = newDf.loc[:,cols].div(newDf['sum'], axis=0)

fig, ax = plt.subplots(figsize=(16, 8))
ax.set_title(questions[10] + '\n', fontdict={'fontweight': 'bold'})
newDf.plot(kind='bar', stacked=True, ax=ax)
plt.xticks(rotation=45)
plt.show()

He saw that there is no big differences of compensation in function of the size of the company, but when the salary increases, the part of employees for small companies tends to decrease.

Through his analysis of the 2019 Kaggle ML & DS Survey, the student learned a lot about his future job. He felt more confident to launch his career in data science.

**THE END**