This is a guided project on Python using Jupyter Notebook

In [36]:
import csv
#using csv module to read the csv file
with open('kaggle2021-short.csv') as f:
    reader = csv.reader(f, delimiter=",")
    kaggle_data = list(reader)
# Here we separate the column names and actual
# data into two lists
column_names = kaggle_data[0]
survey_responses = kaggle_data[1:]

In [37]:
print(survey_responses[0:5])

[['6.1', 'TRUE', 'FALSE', 'TRUE', 'Scikit-learn', '124267'], ['12.3', 'TRUE', 'TRUE', 'TRUE', 'Scikit-learn', '236889'], ['2.2', 'TRUE', 'FALSE', 'FALSE', 'None', '74321'], ['2.7', 'FALSE', 'FALSE', 'TRUE', 'None', '62593'], ['1.2', 'TRUE', 'FALSE', 'FALSE', 'Scikit-learn', '36288']]


Going through the data we can see that the data types are not proper for each column. To perform analysis cleaning is required in this regard.

In [38]:
# Iterating through each row and assigning proper 
#data types for each column
for row in survey_responses:
    row[0] = float(row[0])
    row[1] = (row[1] == "TRUE")
    row[2] = (row[2] == "TRUE")
    row[3] = (row[3] == "TRUE")
    if row[4] == 'None':
        row[4] = None
    row[5] = int(row[5])

Now we'll look at the number of people who use the different languages and the proportion of them within the dataset.

In [39]:
#Initializing count of different users
n_py = 0
n_r = 0
n_sql = 0
l = len(survey_responses)

#finding the number of users for each language
for row in survey_responses:
    if row[1]:
        n_py += 1
    if row[2]:
        n_r += 1
    if row[3]:
        n_sql += 1
        
# Finding the proportion of users of each 
#language within the dataset

prop_py = n_py/l
prop_r = n_r/l
prop_sql = n_sql/l

#Printing the result
print(f"Number of Python users: {n_py} and proportion is: {prop_py:.2f}")
print(f"Number of R users: {n_r} and proportion is: {prop_r:.2f}")
print(f"Number of SQL users: {n_sql} and proportion is: {prop_sql:.2f}")

Number of Python users: 21860 and proportion is: 0.84
Number of R users: 5335 and proportion is: 0.21
Number of SQL users: 10757 and proportion is: 0.41


* Python users are the highest and is more than the other two combined so it's safe to say Python is the preferred language for data professionals

In [40]:
print(survey_responses[0:5])

[[6.1, True, False, True, 'Scikit-learn', 124267], [12.3, True, True, True, 'Scikit-learn', 236889], [2.2, True, False, False, None, 74321], [2.7, False, False, True, None, 62593], [1.2, True, False, False, 'Scikit-learn', 36288]]


Next up, we will be looking at the data of experience and compensation.
In the next cell, we will be collecting some general insights about these datapoints.

In [41]:
#Intitializing lists
experience_coding = []
compensation = []

#Adding the relevant data from the dataset into the lists
for row in survey_responses:
    experience_coding.append(row[0])
    compensation.append(row[-1])

#Here, we will be finding some stats about the
#experience of the data science professionals
max_exp = max(experience_coding)
min_exp = min(experience_coding)
avg_exp = sum(experience_coding)/len(experience_coding)

desc_exp = f"{max_exp} is the maximum experience in the list, {min_exp} is the lowest experience in the list and {avg_exp:.2f} is the average experience in the dataset."

#Finding stats about the compensation data
max_comp = max(compensation)
min_comp = min(compensation)
avg_comp = sum(compensation)/len(compensation)

desc_comp = f"{max_comp} is the maximum compensation received by someone in the dataset, {min_comp} is the lowest compensation in the list and {avg_comp:.2f} is the average compensation in the dataset"

print(desc_exp)
print(desc_comp)


30.0 is the maximum experience in the list, 0.0 is the lowest experience in the list and 5.30 is the average experience in the dataset.
1492951 is the maximum compensation received by someone in the dataset, 0 is the lowest compensation in the list and 53252.82 is the average compensation in the dataset


Next up, we will be dividing the people into different categories based on their years of experience to find out the general relationship between experience and compensation

In [42]:
# Adding a new column name
column_names.append("Experience Category")

#Adding the experience category value for each row
for row in survey_responses:
    if row[0]<5:
        row.append("Junior")
    elif row[0]<10:
        row.append("Intermediate")
    elif row[0]<15:
        row.append("Senior")
    elif row[0]<20:
        row.append("Expert")
    elif row[0]>=20:
        row.append("Scholar")

In [43]:
survey_responses[:5]

[[6.1, True, False, True, 'Scikit-learn', 124267, 'Intermediate'],
 [12.3, True, True, True, 'Scikit-learn', 236889, 'Senior'],
 [2.2, True, False, False, None, 74321, 'Junior'],
 [2.7, False, False, True, None, 62593, 'Junior'],
 [1.2, True, False, False, 'Scikit-learn', 36288, 'Junior']]

In [44]:
#Here, we will be collecting the compensation of 
#each experience category into separate lists for
#easier calculation
junior = []
intermediate = []
senior = []
expert = []
scholar = []

for row in survey_responses:
    if row[-1] == 'Junior':
        junior.append(row[-2])
    elif row[-1] == 'Intermediate':
        intermediate.append(row[-2])
    elif row[-1] == 'Senior':
        senior.append(row[-2])
    elif row[-1] == 'Expert':
        expert.append(row[-2])
    else:
        scholar.append(row[-2])


In [45]:
#Creating strings that describe the collected data
jun_desc = f"There are {len(junior)} people who have 0-5 years experience. They get an average salary of {sum(junior)/len(junior):.2f}."
int_desc = f"There are {len(intermediate)} people who have 5-10 years experience. They get an average salary of {sum(intermediate)/len(intermediate):.2f}."
sen_desc = f"There are {len(senior)} people who have 10-15 years experience. They get an average salary of {sum(senior)/len(senior):.2f}."
exp_desc = f"There are {len(expert)} people who have 15-20 years experience. They get an average salary of {sum(expert)/len(expert):.2f}."
sch_desc = f"There are {len(scholar)} people who have 20+ years experience. They get an average salary of {sum(scholar)/len(scholar):.2f}."

In [46]:
print(jun_desc)
print(int_desc)
print(sen_desc)
print(exp_desc)
print(sch_desc)

There are 18753 people who have 0-5 years experience. They get an average salary of 45047.87.
There are 3167 people who have 5-10 years experience. They get an average salary of 59312.82.
There are 1118 people who have 10-15 years experience. They get an average salary of 80226.76.
There are 1069 people who have 15-20 years experience. They get an average salary of 75101.83.
There are 1866 people who have 20+ years experience. They get an average salary of 96747.88.


* The distribution is very uneven as the new boom in AI and Machine Learning has brought a huge number of new people into the field as is evidenced by the huge difference between the numbers in the 0-10 yer group and the rest.

* The compensation increases based on experience as can be seen. Butm there is a slight anomaly in the 15-20 year category, as the average compensation is lower than the 10-15 group. The 20+ group however has a higher average compensation than both 10-15 and 15-20 year groups, so the trend is followed there.

* There is quite a high jump in compensation from the 5-10 group and 10-15 group. Then the 15-20 group has quite lesser average compensation. This could be because of the less representation of these groups as is evident.

* The most extreme value that can be seen is the 18753 people in the 0-5 year category which is more than twice the sum of all the other categories. This could be attributed to the recent boom in AI and Machine Learning. 

Now, we'll look at the number of programming languages each person knows and compare the compensation with that

In [47]:
#Adding a new attribute which gives the number
#of programming languages known to each of them.
for row in survey_responses:
    num_pro = 0
    for column in row:
        if type(column) == bool:
            if column:
                num_pro += 1
    row.append(num_pro)

column_names.append("number_of_known_languages")
        

In [48]:
survey_responses[:5]

[[6.1, True, False, True, 'Scikit-learn', 124267, 'Intermediate', 2],
 [12.3, True, True, True, 'Scikit-learn', 236889, 'Senior', 3],
 [2.2, True, False, False, None, 74321, 'Junior', 1],
 [2.7, False, False, True, None, 62593, 'Junior', 1],
 [1.2, True, False, False, 'Scikit-learn', 36288, 'Junior', 1]]

In [49]:
#Now let's create three new lists which will store
#the compensation according to the number of languages
#they know.

one = []
two = []
three = []

for row in survey_responses:
    if row[-1] == 1:
        one.append(row[-3])
    elif row[-1] == 2:
        two.append(row[-3])
    else:
        three.append(row[-3])
        
        
one[:5]

[74321, 62593, 36288, 61302, 18858]

In [50]:
print(f"The average compensation of {len(one)} professionals who know just one language is {sum(one)/len(one):.2f}.")
print(f"The average compensation of {len(two)} professionals who know two languages is {sum(two)/len(two):.2f}.")
print(f"The average compensation of {len(three)} professionals who know three languages is {sum(three)/len(three):.2f}.")

The average compensation of 11761 professionals who know just one language is 52435.36.
The average compensation of 8927 professionals who know two languages is 54821.63.
The average compensation of 5285 professionals who know three languages is 52422.03.


* The number of people who use just one language is very high compared to the other two categories.

* There is no real correlation between the number of languages used and the compensation being received as per this data.

Let's see if there's a difference in compensation between those who know python and those who don't

In [51]:
knows_python = []
no_python = []

for row in survey_responses:
    if row[1]:
        knows_python.append(row[5])
    else:
        no_python.append(row[5])


In [56]:
print(f"The average compensation of {len(knows_python)} people who use Python is, {sum(knows_python)/len(knows_python):.2f}.")
print(f"The average compensation of {len(no_python)} people who don't use Python is, {sum(no_python)/len(no_python):.2f}.")

The average compensation of 21860 people who use Python is, 54331.17.
The average compensation of 4113 people who don't use Python is, 47521.53.


* There is a big difference between the number of people who use Python and don't revealing again that it is the most popular language for data science.

* We can see that there is an evident gap in compensation between those people who use Python and those who don't. People who know Python earn roughly $7000 more on average per year which can be said to be substantial