# The older people in 2022 Kaggle ML and DS Survey
# Introduction
In this notebook, I will analyze only a subset of the Kaggle survey. Each age group takes part in the Kaggle survey every year. But I will analyze only those respondents who are above 70 years, and I will call them "older people" or "older respondents" in this notebook. I will explore the older respondents in the 2022 survey and compare the participation of older people in the 2022 survey with the previous two years' surveys. More, this notebook will cover some insight into their gender, country, education qualifications, machine learning experience, Current Role, and Salary.

![](https://feistysideoffifty.com/wp-content/uploads/2021/06/960x0.jpg)
image source: feistysideoffifty.com/2021/06/17/how-has-the-pandemic-affected-older-workers

In [1]:
import numpy as np # linear algebra
import pandas as pd

import os
os.listdir("../input")
!pip install chart_studio
# Standard plotly imports
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode# Using plotly + cufflinks in offline mode
import cufflinks
cufflinks.go_offline(connected=True)

init_notebook_mode(connected=True)
import warnings


Collecting chart_studio
  Downloading chart_studio-1.1.0-py3-none-any.whl (64 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.4/64.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: chart_studio
Successfully installed chart_studio-1.1.0
[0m


The Shapely GEOS version (3.9.1-CAPI-1.14.2) is incompatible with the GEOS version PyGEOS was compiled with (3.10.3-CAPI-1.16.1). Conversions between both will be slow.



In [2]:
df_2020=pd.read_csv("../input/kaggle-survey-2020/kaggle_survey_2020_responses.csv")
df_2021=pd.read_csv("../input/kaggle-survey-2021/kaggle_survey_2021_responses.csv")
df_2022=pd.read_csv("../input/kaggle-survey-2022/kaggle_survey_2022_responses.csv")
# Skip the first row as it keeps the questions' titles
df_2020 = df_2020[1:]
df_2021 = df_2021[1:]
df_2022 = df_2022[1:]


Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.


Columns (0,195,201,285,286,287,288,289,290,291,292) have mixed types.Specify dtype option on import or set low_memory=False.


Columns (0,208,225,255,257,260,270,271,277) have mixed types.Specify dtype option on import or set low_memory=False.



# 1. age distribution of respondents in 2022 survey

We want to see what percentage of older people are participating in the Kaggle ML and DS survey 2020 compared to other age groups. Let's explore!

In [3]:
age22=df_2022['Q2'].value_counts(sort=True)
labels=age22.index
values=age22.values



fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='percent',
                             insidetextorientation='radial'
                            )])
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.update_layout(
    title_text="age distribution of respondents in 2022",legend=dict(orientation="h"))
fig.show()

**Key insights:**

Only a fewer respondents are aged 70 and above.
Participation of older people is only 0.529% than the other age groups

# Older respondents over the years

Whether older respondents are increasing over the years

In [4]:
from plotly.subplots import make_subplots
older_people = ['70+']

df_older1= df_2022[df_2022['Q2'].isin(older_people )]
df_2022['compare']=["70+" if x in older_people else "other" for x in df_2022['Q2']]

df_older2= df_2021[df_2021['Q1'].isin(older_people )]
df_2021['compare']=["70+" if x in older_people else "other" for x in df_2021['Q1']]

df_older3= df_2020[df_2020['Q1'].isin(older_people )]
df_2020['compare']=["70+" if x in older_people else "other" for x in df_2020['Q1']]
age20=df_2020['compare'].value_counts(sort=True)
age21=df_2021['compare'].value_counts(sort=True)
age22=df_2022['compare'].value_counts(sort=True)
labels=['other' ,'70+' ]


fig = make_subplots(rows=1, cols=3, specs=[[{'type':'domain'}, {'type':'domain'},{'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels, values=age22.values, name="2022"),
              1, 1)
fig.add_trace(go.Pie(labels=labels, values=age21.values, name="2021"),
              1, 2)
fig.add_trace(go.Pie(labels=labels, values=age20.values, name="2020"),
              1, 3)
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.data[1].marker.line.width = 1
fig.data[1].marker.line.color = "black"
fig.data[2].marker.line.width = 1
fig.data[2].marker.line.color = "black"
fig.update_layout(
    title_text="Age distribution over the years",legend=dict(orientation="h"),
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='2022', x=0.09, y=0.5, font_size=20, showarrow=False),
                 dict(text='2021', x=0.5, y=0.5, font_size=20, showarrow=False),
                 dict(text='2020', x=0.89, y=0.5, font_size=20, showarrow=False)])
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.5, hoverinfo="label+percent+name")
fig.show()

**Key insights:**

When we compare the participation of older people in 2022 to the previous two years, it is slightly increasing year by year, but this change is very small.

# 2. Gender of older people in 2022

Whether each gender is equally participating?

In [5]:
#Taking only ages over 70 from each year
aged_70_20=df_2020[df_2020['Q1']=='70+']
aged_70_21=df_2021[df_2021['Q1']=='70+']
aged_70_22=df_2022[df_2022['Q2']=='70+']
colors = [' deepskyblue', 'lightcyan', ' cyan ','royalblue'] 
gender22=aged_70_22['Q3'].value_counts(sort=True)
labels=gender22.index
values=gender22.values
fig = go.Figure(data=[go.Pie(labels=labels, values=values, textinfo='percent',
                             insidetextorientation='radial',marker=dict(colors=colors)
                            )])
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.update_layout(
    title_text="Gender distribution of older poeple respondents in 2022",legend=dict(orientation="h"))
fig.show()



****Key insights:****

* There is a huge difference between men and women respondents of older people.
* 88.2% are men and only 7.09% were women.

#  Gender distribution over the years

In [6]:
colors2 = [' deepskyblue', 'lightcyan', ' cyan ','royalblue'] 
gender20=aged_70_20['Q2'].value_counts(sort=True)
gender21=aged_70_21['Q2'].value_counts(sort=True)
gender22=aged_70_22['Q3'].value_counts(sort=True)
labels1=['Man'	,'Woman','Prefer not to say']
labels2=['Man'	,'Woman','Prefer not to say','Ninbinary']
labels3=['Man'	,'Woman','Prefer not to say','Prefer to self-describe']
fig = make_subplots(rows=1, cols=3, specs=[[{'type':'domain'}, {'type':'domain'},{'type':'domain'}]])
fig.add_trace(go.Pie(labels=labels1, values=gender20.values, name="2020",marker=dict(colors=colors2)),
              1, 1)
fig.add_trace(go.Pie(labels=labels2, values=gender21.values, name="2021",marker=dict(colors=colors2)),
              1, 2)
fig.add_trace(go.Pie(labels=labels3, values=gender22.values, name="2022",marker=dict(colors=colors2)),
              1, 3)
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.data[1].marker.line.width = 1
fig.data[1].marker.line.color = "black"
fig.data[2].marker.line.width = 1
fig.data[2].marker.line.color = "black"
fig.update_layout(
    title_text="the older people gender distribution over the years",legend=dict(orientation="h"),
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='2020', x=0.10, y=0.5, font_size=20, showarrow=False),
                 dict(text='2021', x=0.5, y=0.5, font_size=20, showarrow=False),
                 dict(text='2022', x=0.89, y=0.5, font_size=20, showarrow=False)])
# Use `hole` to create a donut-like pie chart
fig.update_traces(hole=.6, hoverinfo="label+percent+name")

fig.show()

**Key insights:**

* Women respondents are gradually increasing over the years.
* While men respondents were 94.7% in the 2020 survey, then it slightly declined in the 2021 survey, now compared to the previous year 2021, and again increased in 2022.

# 3. Nationality of older people

Whether older respondents participated from almost all countries in this survey.

In [7]:
aged_70_22['Q4'].replace(
                                                   {'United States of America':'United States',
                                                    
                                                    
                                                    "United Kingdom of Great Britain and Northern Ireland":'United Kingdom',
                                                    'I do not wish to disclose my location':"don't show location"
                                                   },inplace=True)
def get_name(code):
    '''
    Translate code to name of the country
    '''
    try:
        name = pycountry.countries.get(alpha_3=code).name
    except:
        name=code
    return name
country_number = pd.DataFrame( aged_70_22['Q4'].value_counts())
country_number['country'] = country_number.index
country_number.columns = ['number', 'country']
country_number.reset_index().drop(columns=['index'], inplace=True)
country_number['country'] = country_number['country'].apply(lambda c: get_name(c))

choropleth_map = go.Figure(
    data = {
        'type':'choropleth',
        'locations':country_number['country'],
        'locationmode':'country names',
        'colorscale':'gnbu',
        'z':country_number['number'],
        'colorbar':{'title':'Number of respondents'},
        'marker': {
            'line': {
                'width':0.6 ,'color':'black'
            }
        }
    },     
    layout = { 'title_text':'Nationality of older people in 2022'  ,
      'geo':{
          'scope':'world', 
          'showframe':False,
          'showcoastlines': False,
          'projection_type':'equirectangular' 
      }  
    })

choropleth_map

**Key insights:**

* The majority of older people participants are from the United States and India.
* The older people who participated in this survey are only from a few countries.

# Top 10 countries with older people respondents

In [8]:

top_10country=aged_70_22['Q4'].value_counts(sort=True)[:10]

x=top_10country.index
y=top_10country.values
fig = go.Figure([go.Bar(x=x, y=y ,text=y,
            width=0.5,
            textposition='auto',
            marker=dict(color='deepskyblue'))])
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.update_layout(yaxis=dict(title='Percentage of older Respondents'),width=800,height=600,
                  title='Top 10 countries with older people respondents',
                  xaxis=dict(title='country'))
fig.show()

**Key insight**
* Nearly half of the older respondents are from the United States.

# Gender distribution by country

Only a few women participated in this survey; let's check whether they are from which countries.

In [9]:
# remove rows using the drop() function
aged_70_22.drop(aged_70_22.index[aged_70_22['Q4'] == 'Other'] , inplace=True)
aged_70_22.drop(aged_70_22.index[aged_70_22['Q4'] == "don't show location"], inplace=True)
# display the dataframe
df_country = (
    aged_70_22.groupby("Q3")["Q4"]
    .value_counts()
    .unstack()
)



ax = df_country.iplot(kind="bar",xTitle="COUNTRY",yTitle='count',title=" GENDER DISTRIBUTION BY COUNTRY",width=800)


**key insight**
* Mostly the older respondents from all countries are men.
* Women respondents are from Australia, Ukraine, and United States. 
* Most women are from the United States. 
* Ukraine is the only respondent who is a woman.

# 4.  Education qualification of older people

In [10]:
education=round(aged_70_22['Q8'].value_counts(normalize=True)*100)

x=education.index
y=education.values
fig = go.Figure([go.Bar(x=x, y=y ,text=y , width=0.4,
            textposition='auto',
            marker=dict(color='deepskyblue'))])
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.update_layout(yaxis=dict(title='percentage of Respondents'),width=1000,height=600,
                  title='education qualification of older people',
                  xaxis=dict(title='education'))
fig.show()


**Key insights:**
* The majority of older respondents having a Master's degree followed closely by a PhDs degree.

# Education qualification in top six countries

In [11]:
df_country = (
    aged_70_22.groupby("Q8")["Q4"]
    .value_counts()
    .unstack()
)

top6 = df_country.sum(axis=0).sort_values(ascending=False).index[:6].tolist()


ax = df_country[top6].T.iplot(kind="bar", xTitle="COUNTRY",yTitle='counts',title="Education qualification in the top six countries")

**Key insight**

* The United States, Canada, and Belgium have a maximum number of older respondents having master's degrees, while India and Australia have a maximum number of PhDs degrees among the top 6 countries.

# 5. ML experience of older people

In [12]:
ml_exp=round(aged_70_22['Q16'].value_counts(normalize=True)*100)

x=ml_exp.index
y=ml_exp.values
fig = go.Figure([go.Bar(x=x, y=y ,text=y , width=0.4,
            textposition='auto',
            marker=dict(color='deepskyblue'))])
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.update_layout(yaxis=dict(title='percentage of Respondents'),width=800,height=600,
                  title='ML experience of older people',
                  xaxis=dict(title='ML experience'))
fig.show()

**Key insight**
* Most of the older respondents have between 10 to 20 years of machine learning experience and are followed closely by those under one year of experience and no experience.

# 6. Current Role of older respondents

In [13]:
role=round(aged_70_22['Q23'].value_counts(normalize=True)*100)

x=role.index
y=role.values
fig = go.Figure([go.Bar(x=x, y=y ,text=y , width=0.5,
            textposition='auto',
            marker=dict(color='deepskyblue'))])
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.update_layout(yaxis=dict(title='percentage of Respondents'),width=800,height=600,
                  title='role of older people',
                  xaxis=dict(title='role'))
fig.show()

**Key insight**
* Most of the older respondents are unemployed while the most employed role is teacher-professor, closely followed by research scientists and data scientists.

# Current Role of older people in top six countries

In [14]:
df_country = (
    aged_70_22.groupby("Q23")["Q4"]
    .value_counts()
    .unstack()
)

top6 = df_country.sum(axis=0).sort_values(ascending=False).index[:6].tolist()


ax = df_country[top6].T.iplot(kind="bar",xTitle="COUNTRY",yTitle='COUNTS',title="ROLE OF OLDER RESPONDENTS IN TOP SIX COUNTRIES")

**key insight**
* The U.S has the maximum number of Data Scientists who took part in the survey, followed by Brazil.
* The most common job role of older people in this survey is teacher /professor.
* Canada, Spain, and Brazil are those countries where there are no unemployed older respondents.
* While the United States, India, and Belgium have unemployed older respondents.


# Current Role and gender

In [15]:
df_country = (
    aged_70_22.groupby("Q3")["Q23"]
    .value_counts()
    .unstack()
)




ax = df_country.iplot(kind="bar",xTitle="gender",yTitle='COUNTS',title="gender and role")

**key insight**

* In each role, there are only men, almost no women.

# Data Scientists over the year

Whether the number of older respondents increasing over the years whose current role is a data scientist

In [16]:
a=aged_70_20[aged_70_20['Q5']=='Data Scientist']

b=aged_70_21[aged_70_21['Q5']=='Data Scientist']
c=aged_70_22[aged_70_22['Q23']=='Data Scientist']

data_scientist = pd.DataFrame(data = [len(a),len(b),len(c)],
                          columns = ['Number of data scientist'], index = ['2020','2021','2022'])
data_scientist.index.names = ['Year of Survey']


x = data_scientist['Number of data scientist'].index
y = data_scientist['Number of data scientist'].values


# Use textposition='auto' for direct text
fig = go.Figure(data=[go.Bar(
            x=['Year 2020','Year 2021','Year 2022'],
            y=y,
            text=y,
            width=0.3,
            textposition='auto',
            marker=dict(color='deepskyblue')
 )])
fig.update_layout(yaxis=dict(title='Number of Respondents'),width=800,height=600,
                  title='DATA SCIENTISTS OVER THE YEARS',
                  xaxis=dict(title='Year'))
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.show()

**Key insight**
* Participation of data scientists decreasing year by year in Kaggle surveys.

# Data Scientist respondents by country in 2022 survey

In [17]:
c=aged_70_22[aged_70_22['Q23']=='Data Scientist']
countrywise=c['Q4'].value_counts()
x = countrywise.index
y = countrywise.values


# Use textposition='auto' for direct text
fig = go.Figure(data=[go.Bar(
            x=x,
            y=y,
            text=y,
            width=0.4,
            textposition='auto',
            marker=dict(color='deepskyblue')
 )])
fig.update_layout(yaxis=dict(title='Number of Respondents'),width=800,height=600,
                  title='DATA SCIENTISTS BY COUNTRY',
                  xaxis=dict(title='Country'))
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.show()

**Key insight**

* There are only eight data scientists among all older respondents
* The United States has three data scientists followed by Brazil.

# Machine learning experience of data scientists by country in 2022 survey

In [18]:
a=aged_70_22[aged_70_22['Q23']=='Data Scientist']
df_country = (
    a.groupby("Q4")["Q16"]
    .value_counts()
    .unstack()
)




ax = df_country.iplot(kind="bar",xTitle="country",yTitle='count',title="ML experience of data scientist by country")

**Key insight**
* Two data scientists are highly experienced (10-20 years) in machine learning in the United States, and Australia has also one with the same experience.

# 7. Salary of older people

In [19]:
salary=round(aged_70_22['Q29'].value_counts(normalize=True)*100)

x=salary.index
y=salary.values
fig = go.Figure([go.Bar(x=x, y=y ,text=y , width=0.4,
            textposition='auto',
            marker=dict(color='deepskyblue'))])
fig.data[0].marker.line.width = 1
fig.data[0].marker.line.color = "black"
fig.update_layout(yaxis=dict(title='percentage of Respondents'),width=1000,height=600,
                  title='Salary of older people',
                  xaxis=dict(title='salary in USD'))
fig.show()

**Key insight**
* The majority of older respondents' salary is less than 1,000 USD.
* The highest salary is between 300,000 USD and 499,999 USD.

# Data scientist salaries by country in 2022 survey

Which country pays a higher salary than another country to older respondents whose current role is as a data scientist?

In [20]:
a=aged_70_22[aged_70_22['Q23']=='Data Scientist']
df_country = (
    a.groupby("Q4")["Q29"]
    .value_counts()
    .unstack()
)




ax = df_country.iplot(kind="bar",xTitle="COUNTRY",yTitle='counts',title="Data scientist salaries by country")

**Key insight**
* The United States and Australia are two countries that pay higher than other countries, while the lowest salary pays in India.

# Key takeawys
Participation of older people is very low and has not seen much improvement over the years. And the number of women compared to men is very low. Almost half of the respondents are from the United States. Most of the respondents are not currently employees. While the number of older respondents whose role is data scientist is decreasing year by year.We need to see the reason behind this.

**References**
* https://www.bcli.org/older-adult-older-person/
* https://www.bigdataflare.com/data-visualization-in-python/
* https://plotly.com/python/bar-charts/
* https://plotly.com/python/pie-charts/
* https://plotly.com/python/choropleth-maps/
