In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error 
%matplotlib inline

In [None]:
# importing the data set and have a quick look
df=pd.read_csv('survey_results_public.csv')
df.head()

In [None]:
#importing the second cvs file which is helpful to know better each column of main cvs file
schema=pd.read_csv('survey_results_schema.csv')
schema.head()

In [None]:
df["CompTotal"]

The first thing that I am interested to know more about it is: how is education level of developers in each country. So, let's focus on "EdLevel" column in more details:

In [None]:
# .3 percent null value in education level column which is fairly
df['EdLevel'].isnull().sum()/df.shape[0]

In [None]:
df['EdLevel'].value_counts()/df.shape[0]

In [None]:
pie, ax = plt.subplots(figsize=[10,6])
data=df['EdLevel'].value_counts()
labels = data.keys()
plt.pie(x=data, autopct="%.1f%%", labels=labels, pctdistance=0.5)
plt.title("EdLevel_ALL Participants", fontsize=14);
plt.show()

More than 40% of participants in survey hold Bachelor's Degree, which is aggregate number for all participants from all countries. Specifically, I am interested to find this statistics for USA and Canada In the first part of my first question: *I am interested to find Education Level for USA and Canada*.

In the second part: If the "PhD" group have some specific advantage in some attributes in comparison to the others group ?

So: here is summary of question 1:

### Question 1: What is Education Level for participants from Canada and USA specifically? If there are specific parameter which help individuals to be full time employed?

In [None]:
df_Canada=df[df['Country']=='Canada']

In [None]:
df_Canada['EdLevel'].value_counts()/df_Canada.shape[0]

In [None]:
df['Country'].value_counts()

In [None]:
df_USA=df[df['Country']=='United States of America']

In [None]:
df_USA['EdLevel'].value_counts()/df_USA.shape[0]

In [None]:
plt.figure(0);
pie, ax = plt.subplots(figsize=[10,6]);
data_Canada=df_Canada['EdLevel'].value_counts();
labels = data_Canada.keys();
plt.pie(x=data_Canada, autopct="%.1f%%", labels=labels, pctdistance=0.5);
plt.title("EdLevel_Canada Participants", fontsize=14);

plt.figure(1);
pie, ax = plt.subplots(figsize=[10,6]);
data_USA=df_USA['EdLevel'].value_counts();
labels = data_USA.keys();
plt.pie(x=data_USA, autopct="%.1f%%", labels=labels, pctdistance=0.5);
plt.title("EdLevel_USA Participants", fontsize=14);

plt.show();

Now, I am interested to see if PhD degree could have impact on better job market situation in USA in comparison to Canada and globally? 

In [None]:
#For all participants:
df[df['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].value_counts()/df[df['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].shape[0]

In [None]:
#For Canada:
df_Canada[df_Canada['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].value_counts()/df_Canada[df_Canada['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].shape[0]

In [None]:
#For Canada:
df_USA[df_USA['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].value_counts()/df_USA[df_USA['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].shape[0]

In [None]:
education_count = df[df['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].value_counts()/df[df['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].shape[0]
plt.figure(figsize=(10,5))
sns_barplot=sns.barplot(education_count.index, education_count.values, alpha=0.8)
plt.title("PhD holders's employment status (All Participants)")
plt.ylabel('Proportion', fontsize=12)
plt.xlabel('Employment Status', fontsize=12)
for item in sns_barplot.get_xticklabels():
    item.set_rotation(90)
plt.show()

In [None]:
education_count = df_Canada[df_Canada['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].value_counts()/df_Canada[df_Canada['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].shape[0]
plt.figure(figsize=(10,5))
sns_barplot=sns.barplot(education_count.index, education_count.values, alpha=0.8)
plt.title("PhD holders's employment status (Canada)")
plt.ylabel('Proportion', fontsize=12)
plt.xlabel('Employment Status', fontsize=12)
for item in sns_barplot.get_xticklabels():
    item.set_rotation(90)
plt.show()

In [None]:
education_count = df_USA[df_USA['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].value_counts()/df_USA[df_USA['EdLevel']=='Other doctoral degree (Ph.D., Ed.D., etc.)']['Employment'].shape[0]
plt.figure(figsize=(10,5))
sns_barplot=sns.barplot(education_count.index, education_count.values, alpha=0.8)
plt.title("PhD holders's employment status (Canada)")
plt.ylabel('Proportion', fontsize=12)
plt.xlabel('Employment Status', fontsize=12)
for item in sns_barplot.get_xticklabels():
    item.set_rotation(90)
plt.show()

It looks like PhD has a big impact on employment rate since very big portion of PhD holders are employed full-time.

### Question 2: How People are paid regarding the years of work? 

happy with their Career (CareerSatisfaction in data frame)? What about Canada and USA specifically ? if there is a link between Career Satisfaction and some important parameters like: ProblemSolving, BuildingThings, LearningNewTech, JobSecurity and DiversityImportant ?
### How US participants statistics varies from global (i.e. all data)?

In [None]:
# Let's initialize the survey results againg and see what is the percentage of null values
df=pd.read_csv('survey_results_public.csv')

In [None]:
df["CompTotal"].isna().sum()/df["CompTotal"].shape[0]

In [None]:
# It looks like there are plenty of them. Let's eliminate them.

In [None]:
#drop null values in interested columns for this question
df.dropna(subset=["CompTotal", "YearsCode"],inplace=True)

In [None]:
#create new data frame and just include interested columns
df1=df[["CompTotal","YearsCode"]]

In [None]:
#Check point
df1.isnull().sum()

In [None]:
df1 = df1.sort_values("YearsCode")
df1 = df1[df1["YearsCode"] != "More than 50 years"]

In [None]:
df1 = df1[df1["YearsCode"] != "Less than 1 year"]

In [None]:
df1["YearsCode"] = pd.to_numeric(df1["YearsCode"])
df1["CompTotal"] = pd.to_numeric(df1["CompTotal"])

In [None]:
# It is understood from the data that there are several outliers and this is eliminated using a top Compensation of 10000000
df1 = df1[df1["CompTotal"] < 10000000]

In [None]:
#Statistical information of all data set
df1_count = df1.groupby("YearsCode", as_index=False).count()

In [None]:
df1_count.head()

In [None]:
#bar chart of all data count
chart=sns.catplot(x="YearsCode", y="CompTotal", palette="ch:.25", data=df1_count)
chart.set_xticklabels(rotation=45);

In [None]:
#Mean information of all data set
df1_mean = df1.groupby("YearsCode", as_index=False).mean()

In [None]:
#bar chart of mean

chart=sns.catplot(x="YearsCode", y="CompTotal", palette="ch:.30", data=df1_mean)
chart.set_xticklabels(rotation=45);


It looks like mean compensation drops after 15-17 years of working. This can be due to the technical expertise since there weren't many technologies if you are working for about 50 years. Also our data has more input points from 5-10 years of experienced people. This can be interpreted as older people dont like to participate or hang around on Stackoverflow.

### Question 3: How "Salary" is related to coding starting?

During the lesson, the instructor tried to use Machine Learning to predict salary. For the last question of this project, I decided to find something similar to that, but instead of considering ALL categorical and ALL quantitative parameters that he considered in his analysis, I am going to focus on the following variables:
- Race
- Gender
- Currency
- HoursPerWeek
- YearsProgram
- YearsCodedJob

My justification for doing that is to just get rid of *OverFitting* issue in comparison to the case of considering ALL data for my prediction.

P.S: Initially I had plan to consider **Expected Salary** instead of Salary as dependent variable, but after initial investigations I found there are only few data in this data set which has Expected Salary value. So I switch my gear again and focus on Salary but with different variables.



In [None]:
# Let's initialize the survey results againg and see what is the percentage of null values
df=pd.read_csv('survey_results_public.csv')

In [None]:
#drop null values in interested columns for this question
df.dropna(subset=["CompTotal", "Age1stCode"],inplace=True)

In [None]:
df_compare = df[["CompTotal","Age1stCode"]]

In [None]:
df_compare = df_compare[df_compare["CompTotal"] < 10000000]

In [None]:
df_compare.groupby("Age1stCode").mean()

It can be seen that 18-24 years group makes the most. Second group is older than 64 years. This may be due to the fact that they can be executives grown interest to coding.

In [None]:
#bar chart of mean

chart=sns.catplot(x="Age1stCode", y="CompTotal", kind="bar", palette="ch:.30", data=df_compare.groupby("Age1stCode", as_index=False).mean())
chart.set_xticklabels(rotation=60);