# Exploratory Data Analysis of 2021 Kaggle Survey
## About Data Science


##### Competition information:

This survey data EDA provides an overview of the industry on an aggregate scale, but it also leaves us wanting to know more about the many specific communities comprised within the survey. 
This analysis will help to tell the diverse stories of data scientists from around the world.

The challenge objective: tell a data story about a subset of the data science community represented in this survey, through a combination of both narrative text and data exploration. A “story” could be defined any number of ways, and that’s deliberate. The challenge is to deeply explore (through data) the impact, priorities, or concerns of a specific group of data science and machine learning practitioners. That group can be defined in the macro (for example: anyone who does most of their coding in Python) or the micro (for example: female data science students studying machine learning in masters programs). This is an opportunity to be creative and tell the story of a community you identify with or are passionate about!

##### Submissions will be evaluated on the following:

- Composition - Is there a clear narrative thread to the story that’s articulated and supported by data? The subject should be well defined, well researched, and well supported through the use of data and visualizations.
- Originality - Does the reader learn something new through this submission? Or is the reader challenged to think about something in a new way? A great entry will be informative, thought provoking, and fresh all at the same time.
- Documentation - Are your code, and notebook, and additional data sources well documented so a reader can understand what you did? Are your sources clearly cited? A high quality analysis should be concise and clear at each step so the rationale is easy to follow and the process is reproducible

Data source: Kaggle

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
df = pd.read_csv('../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv')
df=df.iloc[1:]
df.head()

In [3]:
# ML : Machine Learning, CV: Computer Vision, NLP: Natural language processing, DS: Data Science, CC: Cloud Computing, BD: Big Data
cols = [
    'times','age','gender','country','education level','job role','years of being programmer',
    'using language','recommend language',
    'IDEs','notebook products','computer platform',
    'specialized hardward','num of TPU use','visualizaation libraries', 'years of ML use','ML frameworks',
    'ML algorithms','CV methods', 'NLP methods', 
    'industry','company size','num of DS employees','ML useage level','important work of DS','compensation',
    'spent on ML or cloud for 5years', 'CC platform','best experience in CC','CC products','storage products',
    'ML products','BD products', 'best BD product','BI tools','best BI tool','automated ML tools','best'
    
]

In [4]:
print(df.shape[0],'rows',df.shape[0],'columns')

In [5]:
df[df.columns[9]].unique()

In [6]:
df.isnull().sum()

In [7]:
df.describe().T

## Education Level 

In [8]:
edu_level = pd.DataFrame(df.groupby("Q4",dropna=False).size()).reset_index()
edu_level.columns = ['Education Level', 'Counts']
edu_level = edu_level.sort_values('Counts',ascending=False).reset_index(drop=True)
edu_level = edu_level[(edu_level["Counts"]>1)]
edu_level.loc[3,'Education Level'] = 'College/Univ w/o degree'
edu_level.loc[4,'Education Level'] = 'No Answer'
edu_level.loc[5,'Education Level'] = 'High school'
print('{Distribution of Education Level}')
print(edu_level)

In [9]:
sns.set(style='darkgrid')
ax = sns.barplot(x = edu_level['Counts'],y=edu_level['Education Level'],
                  data = edu_level,
                   palette='Set3')

plt.rcParams['figure.dpi'] = 300

plt.show()

In [10]:
import plotly.express as px
colors = ['rgb(177, 127, 38)', 'rgb(205, 152, 36)', 'rgb(99, 79, 37)',
                     'rgb(129, 180, 179)', 'rgb(124, 103, 37)']
fig1 =px.pie(edu_level, values = 'Counts',names='Education Level',title='Education Level in Data Science Field',
            color_discrete_sequence=px.colors.sequential.matter)

fig1.show()

In [11]:
edu_gender = df[["Q2","Q4"]]

edu_gender.head()

### Education level by Gender

In [12]:
# education level by gender
sns.set(style='darkgrid')
ax = sns.countplot(y = edu_gender['Q4'],
                  data = edu_gender,hue='Q2',
                   palette='Set3')

In [115]:
gender= pd.DataFrame(edu_gender.groupby("Q2",dropna=False).size()).reset_index()
gender.columns = ['Gender','Counts']
gender

In [119]:
### Data set by gender counts
gender= pd.DataFrame(edu_gender.groupby("Q2",dropna=False).size()).reset_index()
gender.columns = ['Gender','Counts']

### Graph
fig2 =px.pie(gender, values = "Counts",names="Gender",
             title='Gender in Data Science Field',
            color_discrete_sequence=px.colors.sequential.Plasma_r,
            hole = 0.4)
fig2.update_traces( textfont_size=20)
fig2.show()

Still, many data scientists are male and have master's degrees.
The number of bachelor's degrees and master's degrees is slightly different. Bachelor's degrees are likely to become major degrees in the field of data science because many scholars are entering the field of data science.

# Age Distribution

In [14]:
### Data Set for the range of age
age = pd.DataFrame(df.groupby('Q1',dropna=False).size()).reset_index()
age.columns=['Age Range','Counts']

### Plot
sns.set(style='darkgrid')

ax2= sns.barplot(x=age['Counts'],y=age['Age Range'], data = age,palette='Set2')

plt.rcParams['figure.dpi'] = 300
plt.show()

#### As we can see above, many of people who are in data science are in the range of 18-29.

In [15]:
### Data Frame for 'how many of women and men are in Data science by the range of age'
age_gender=pd.DataFrame(df.groupby(['Q1','Q2'],dropna=False).size())
age_gender = age_gender.reset_index()
age_gender.columns=['Age Range','Gender','Counts']

### Graph
sns.set(style='darkgrid')

plt.rcParams['figure.dpi'] = 300
plt.rc('xtick',labelsize=8) # fontsize of the tick labels
plt.rc('ytick',labelsize=8)
#graph
ax2= sns.barplot(x=age_gender['Age Range'],y=age_gender['Counts'],hue=age_gender['Gender'],data = age_gender,
                   palette='Set3')
#legend
ax2.legend(loc = 'upper right', title_fontsize=6, title='Gender',prop={'size':6})
#labels
ax2.set_xlabel('Age Range',fontsize=8)
ax2.set_ylabel('Counts',fontsize=8)

plt.show()

Most of them are young and male.
The count transition according to the Age Range proceeds in the same way.
### So, how many years have them exprienced coding?

In [16]:
### Data Set for the experienced years
experience = pd.DataFrame(df.groupby('Q6',dropna=False).size()).reset_index()
experience.columns = ['Years','Counts']

experience

In [120]:
### Graph
colors = ['rgb(177, 127, 38)', 'rgb(205, 152, 36)', 'rgb(99, 79, 37)',
                     'rgb(129, 180, 179)', 'rgb(124, 103, 37)']
fig3 =px.pie(experience, values = 'Counts',names='Years',title='Exprience',hole = 0.5,
            color_discrete_sequence=px.colors.qualitative.Bold)
fig3.update_traces(title='Years of Experience',
                 title_font = dict(size = 20),
                 textfont_size = 15)
fig3.show()

People who are newly started coding using data science.

# Languages
What language do you recommend?

In [123]:
### Data Set for computer language useage
lng_data =pd.DataFrame(df.groupby('Q8',dropna=False).size()).reset_index()
lng_data.columns = ['Language','Counts']
lng_data

### Graph 

colors = ['rgb(177, 127, 38)', 'rgb(205, 152, 36)', 'rgb(99, 79, 37)',
                     'rgb(129, 180, 179)', 'rgb(124, 103, 37)']
fig4 =px.pie(lng_data, values = 'Counts',names='Language',title='Education Level in Data Science Field',
            color_discrete_sequence=px.colors.sequential.matter)
fig4.update_traces(textfont_size = 15)
fig4.show()

The largest proportion of the language usage in data science is Python.
It can be said that Python is the usable in the field!

# Visualization Libraries people mostly use

Visualization is the most powerful tool that has informative explanation. 
The more essential data science is, the more important visualization skills become.

In [213]:
### Data set of the usage of visualization
vis = df[["Q14_Part_1", "Q14_Part_2","Q14_Part_3", "Q14_Part_4","Q14_Part_5","Q14_Part_6",
          "Q14_Part_7","Q14_Part_8","Q14_Part_9", "Q14_Part_10", "Q14_Part_11", "Q14_OTHER"]]
vis_df = pd.DataFrame(list(vis.stack().tolist()))
vis_df.columns = ['library']

vis_df = pd.DataFrame(vis_df.groupby('library', dropna = False)
                      .size()).reset_index()
vis_df.columns = ['library','counts']
vis_df = vis_df.sort_values('counts',ascending=False).reset_index()

### Plot
sns.set(style='darkgrid')

ax3= sns.barplot(x=vis_df['counts'],y=vis_df['library'], data = vis_df,palette='Set2')

# Years of experience in Machine Learning 

In [215]:
exp_ml = pd.DataFrame(df.Q15.value_counts()).reset_index()
exp_ml.columns()=['Years','Counts']


In [None]:
### Data set of framework
framework = df[["Q16_Part_1", "Q16_Part_2", "Q16_Part_3", "Q16_Part_4", "Q16_Part_5", "Q16_Part_6", "Q16_Part_7", "Q16_Part_8", "Q16_Part_9",
                "Q16_Part_10", "Q16_Part_11", "Q16_Part_12", "Q16_Part_13", "Q16_Part_14", "Q16_Part_15", "Q16_Part_16", 
                "Q16_Part_17", "Q16_OTHER"]]
fw_ml = pd.DataFrame(list(framework.stack().tolist()))
pd.DataFrame(df.Q15.value_counts()).reset_index()
exp_ml.columns()=['Years','Counts']


## Income Level compared to Education Level