In [129]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import bokeh
import plotly
import plotly.express as px
import plotly.graph_objects as go
import plotly.offline as py

In [130]:
#!pip install dash 
#import dash

# Overview
Throughout this assignment, you will be performing certain well-defined tasks that’ll not only strengthen your concepts of Plotly and Dash, but will also help you learn a number of new concepts that are useful in analyzing, summarizing and visualizing data in the real world. 

Here is a template notebook with all the tasks mentioned in detail. **Please complete the tasks within the designated section only.**


## Task 1: Data Loading and Data Aggregation
* Load the 3 data files into the variables data_18, data_19, data_20. 

* Data aggregation is the process of gathering data and presenting it in a summarized format. The data may be gathered from multiple data sources with the intent of combining these data sources into a summary for data analysis.         
Similar to how this dataset involves 3 data files, you’ll often be working on combining information from 2 or more files and analysing it. More often than not, GroupBy is a very useful tool for this purpose. 

  Go through this article to learn more some helpful aggregation tools in Python: https://www.bmc.com/blogs/pandas-group-merge-concatenate-join/ 

  **You don't need to aggregate/ merge the datasets in this assignment, it is only for reading purposes.**

In [131]:
data_2018 = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/IT_Salary_Survey_EU_18-20/Survey_2018.csv')
data_2019 = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/IT_Salary_Survey_EU_18-20/Survey_2019.csv')
data_2020 = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/IT_Salary_Survey_EU_18-20/Survey_2020.csv')


## Task 2: Data Analysis
* Display the first 5 rows of the 2018 survey data
* Display a concise summary of the 2020 data and list out 3 observations/inferences that you observe from the result. For this you will need to use the info() method.
* Display the descriptive statistics of the 2018 survey data
* Display the number of missing values in each column of the 2018 survey data
How many people responded to the survey in each of the 3 years? Has the number increased or decreased over the years?
* Display all the unique values and their frequency in the column - “Number of vacation days” of 2020 data. Write down your observations (at least one) for this result. 


In [132]:
data_2018.head()

Unnamed: 0,Timestamp,Age,Gender,City,Position,Years of experience,Your level,Current Salary,Salary one year ago,Salary two years ago,Are you getting any Stock Options?,Main language at work,Company size,Company type
0,14/12/2018 12:41:33,43.0,M,München,QA Ingenieur,11.0,Senior,77000.0,76200.0,68000.0,No,Deutsch,100-1000,Product
1,14/12/2018 12:42:09,33.0,F,München,Senior PHP Magento developer,8.0,Senior,65000.0,55000.0,55000.0,No,Deutsch,50-100,Product
2,14/12/2018 12:47:36,32.0,M,München,Software Engineer,10.0,Senior,88000.0,73000.0,54000.0,No,Deutsch,1000+,Product
3,14/12/2018 12:50:15,25.0,M,München,Senior Frontend Developer,6.0,Senior,78000.0,55000.0,45000.0,Yes,English,1000+,Product
4,14/12/2018 12:50:31,39.0,M,München,UX Designer,10.0,Senior,69000.0,60000.0,52000.0,No,English,100-1000,Ecom retailer


In [133]:
data_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 765 entries, 0 to 764
Data columns (total 14 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Timestamp                           765 non-null    object 
 1   Age                                 672 non-null    float64
 2   Gender                              751 non-null    object 
 3   City                                736 non-null    object 
 4   Position                            737 non-null    object 
 5   Years of experience                 732 non-null    float64
 6   Your level                          743 non-null    object 
 7   Current Salary                      750 non-null    float64
 8   Salary one year ago                 596 non-null    float64
 9   Salary two years ago                463 non-null    float64
 10  Are you getting any Stock Options?  742 non-null    object 
 11  Main language at work               750 non-n

In [134]:
data_2018.describe()

Unnamed: 0,Age,Years of experience,Current Salary,Salary one year ago,Salary two years ago
count,672.0,732.0,750.0,596.0,463.0
mean,32.183036,8.548497,68381.765333,62187.278523,58013.475162
std,5.107268,4.729557,21196.306557,20163.008663,20413.048908
min,21.0,0.0,10300.0,10001.0,10001.0
25%,29.0,5.0,57000.0,52000.0,48000.0
50%,32.0,8.0,65000.0,60000.0,56000.0
75%,35.0,11.0,75000.0,70000.0,67000.0
max,60.0,38.0,200000.0,200000.0,150000.0


In [135]:
data_2018.isnull().sum()

Timestamp                               0
Age                                    93
Gender                                 14
City                                   29
Position                               28
Years of experience                    33
Your level                             22
Current Salary                         15
Salary one year ago                   169
Salary two years ago                  302
Are you getting any Stock Options?     23
Main language at work                  15
Company size                           15
Company type                           35
dtype: int64

In [136]:
data_2019.isnull().sum()

Zeitstempel                                                                                               0
Age                                                                                                     109
Gender                                                                                                    0
City                                                                                                      0
Seniority level                                                                                          15
Position (without seniority)                                                                              1
Years of experience                                                                                       0
Your main technology / programming language                                                              14
Yearly brutto salary (without bonus and stocks)                                                           1
Yearly bonus                

In [137]:
data_2020.isnull().sum()

Timestamp                                                                                                                    0
Age                                                                                                                         27
Gender                                                                                                                      10
City                                                                                                                         0
Position                                                                                                                     6
Total years of experience                                                                                                   16
Years of experience in Germany                                                                                              32
Seniority level                                                                                                

**The number of Null rows has grown through the years**

In [138]:
data_2020['Number of vacation days'].value_counts()

30                                              488
28                                              233
27                                              102
25                                               91
26                                               71
24                                               67
29                                               24
20                                               13
21                                               10
32                                                8
22                                                8
31                                                8
35                                                5
36                                                5
23                                                4
40                                                4
0                                                 4
14                                                3
33                                                3
unlimited   

**the unique values of this column goes from 1 day to unlimetted for some reaseons and there are some missing values= 68ones**

## Task 3: Data Cleaning
* Rename the column ‘Position ‘ in the 2020 data to ‘Position’. (without the blank space)
* Check for missing values in 2020 data for all the columns. If there are no missing values, proceed to the next step. If there are missing values in the dataset,
  * For categorical variables, fill the missing values with the mode of the data. Remember if the data type of any variable is ‘object’, it is categorical variable. 
  * For numerical variables, fill the missing values with the mean of the data.

Here's a good blog that displays multiple methods of filling (imputing) missing values: https://jamesrledoux.com/code/imputation 
* Drop the timestamp column for all the three years data since the date and time at which a person filled the survey is irrelevant to us. The year matters and we already know that from the dataset’s name.
* Perform any other data cleaning steps you believe are necessary. (removing outliers, handling missing values in a way to beautify visualizations, making the categories uniform i.e python and Python should mean the same thing etc.) Note that the same steps will have to be performed for all 3 data files.

In [139]:
data_2020.rename(columns = {'Position ':'Position'}, inplace = True)

In [140]:
# checking missing values
data_2020.isnull().sum()

Timestamp                                                                                                                    0
Age                                                                                                                         27
Gender                                                                                                                      10
City                                                                                                                         0
Position                                                                                                                     6
Total years of experience                                                                                                   16
Years of experience in Germany                                                                                              32
Seniority level                                                                                                

In [141]:
data_2020.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1253 entries, 0 to 1252
Data columns (total 23 columns):
 #   Column                                                                                                                   Non-Null Count  Dtype  
---  ------                                                                                                                   --------------  -----  
 0   Timestamp                                                                                                                1253 non-null   object 
 1   Age                                                                                                                      1226 non-null   float64
 2   Gender                                                                                                                   1243 non-null   object 
 3   City                                                                                                                     1253 non-null   o

In [142]:
from sklearn.impute import SimpleImputer
imputer_mean = SimpleImputer(missing_values=np.nan, strategy='mean')

In [143]:
# numeric features for year 2020
features_num_2020 = ['Age',
                'Yearly brutto salary (without bonus and stocks) in EUR',
                'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week',]

# categorical features
features_cat_2020 = ['Gender', 'City', 'Position',
       'Total years of experience', 'Years of experience in Germany',
       'Seniority level', 'Your main technology / programming language',
       'Other technologies/programming languages you use often',
       'Number of vacation days',
       'Employment status', 'Сontract duration',
       'Main language at work', 'Company size', 'Company type',
       'Have you lost your job due to the coronavirus outbreak?',
       'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week',
       'Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR']

# features we ignore in the following
features_not_used_2020 = ['Timestamp',
                     'Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country',
                     'Annual bonus+stocks one year ago. Only answer if staying in same country']


# check if we have captured all features
len(features_cat_2020 + features_num_2020 + features_not_used_2020) - len(data_2020.columns)

0

In [144]:
for col in features_num_2020:
  data_2020[col] = imputer_mean.fit_transform(data_2020[col].values.reshape(-1, 1))

In [145]:
imputer_mode = SimpleImputer(missing_values=np.nan, strategy='most_frequent')

In [146]:
for col in features_cat_2020:
  data_2020[col] = imputer_mode.fit_transform(data_2020[col].values.reshape(-1, 1))

In [147]:
# numeric columns
data_2020.fillna(data_2020.select_dtypes(include='number').mean().iloc[0],inplace=True)

# categorical columns
data_2020.fillna(data_2020.select_dtypes(include='object').mode().iloc[0],inplace=True)

data_2020.isnull().sum()

Timestamp                                                                                                                  0
Age                                                                                                                        0
Gender                                                                                                                     0
City                                                                                                                       0
Position                                                                                                                   0
Total years of experience                                                                                                  0
Years of experience in Germany                                                                                             0
Seniority level                                                                                                            0


In [148]:
#renaming 
data_2019.rename(columns={"Zeitstempel": "Timestamp"},inplace=True)
#drop timestamp
data_2018.drop('Timestamp',axis=1, inplace=True)
data_2019.drop('Timestamp',axis=1, inplace=True)
data_2020.drop('Timestamp',axis=1, inplace=True)



In [149]:
data_2018.isnull().sum()

Age                                    93
Gender                                 14
City                                   29
Position                               28
Years of experience                    33
Your level                             22
Current Salary                         15
Salary one year ago                   169
Salary two years ago                  302
Are you getting any Stock Options?     23
Main language at work                  15
Company size                           15
Company type                           35
dtype: int64

In [150]:
data_2018.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 765 entries, 0 to 764
Data columns (total 13 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Age                                 672 non-null    float64
 1   Gender                              751 non-null    object 
 2   City                                736 non-null    object 
 3   Position                            737 non-null    object 
 4   Years of experience                 732 non-null    float64
 5   Your level                          743 non-null    object 
 6   Current Salary                      750 non-null    float64
 7   Salary one year ago                 596 non-null    float64
 8   Salary two years ago                463 non-null    float64
 9   Are you getting any Stock Options?  742 non-null    object 
 10  Main language at work               750 non-null    object 
 11  Company size                        750 non-n

In [151]:
data_2019.drop(['0'],axis=1,inplace=True)
data_2018.rename(columns = {'Your level':'Seniority level'}, inplace = True)
data_2019.rename(columns = {'Position (without seniority)':'Position'}, inplace = True)
data_2019.rename(columns = {'Company name ':'Company name'}, inplace = True)

In [152]:
features_num_2018 = ['Age', 'Years of experience',
                     'Current Salary', 'Salary one year ago','Salary two years ago']
features_cat_2018 = ['Gender', 'City','Position','Seniority level', 'Are you getting any Stock Options?','Main language at work','Company size','Company type']

features_num_2019 = ['Age', 'Years of experience',
                     'Number of vacation days','Number of home office days per month']

features_cat_2019 = ['Gender', 'City','Position','Seniority level', 'Your main technology / programming language',
                     'Main language at work','Company size','Company type','Company name',
                     'Сontract duration','Company business sector']

features_not_used_2019 = ['Yearly brutto salary (without bonus and stocks)','Yearly bonus','Yearly stocks',
                          'Yearly brutto salary (without bonus and stocks) one year ago. Only answer if staying in same country',
                          'Yearly bonus one year ago. Only answer if staying in same country',
                          'Yearly stocks one year ago. Only answer if staying in same country',

                          ]

In [153]:
# numerical
for col in features_num_2018:
  data_2018[col] = imputer_mean.fit_transform(data_2018[col].values.reshape(-1, 1))

# categorical
for col in features_cat_2018:
  data_2018[col] = imputer_mode.fit_transform(data_2018[col].values.reshape(-1, 1))

In [154]:
# numerical
for col in features_num_2019:
  data_2019[col] = imputer_mean.fit_transform(data_2019[col].values.reshape(-1, 1))

# categorical
for col in features_cat_2019:
  data_2019[col] = imputer_mode.fit_transform(data_2019[col].values.reshape(-1, 1))

# not used columns
for col in features_not_used_2019:
  data_2019[col] = imputer_mean.fit_transform(data_2019[col].values.reshape(-1, 1))


In [155]:
# checking missing values after
data_2018.isnull().sum()

Age                                   0
Gender                                0
City                                  0
Position                              0
Years of experience                   0
Seniority level                       0
Current Salary                        0
Salary one year ago                   0
Salary two years ago                  0
Are you getting any Stock Options?    0
Main language at work                 0
Company size                          0
Company type                          0
dtype: int64

In [156]:
# checking missing values after
data_2019.isnull().sum()

Age                                                                                                     0
Gender                                                                                                  0
City                                                                                                    0
Seniority level                                                                                         0
Position                                                                                                0
Years of experience                                                                                     0
Your main technology / programming language                                                             0
Yearly brutto salary (without bonus and stocks)                                                         0
Yearly bonus                                                                                            0
Yearly stocks                                 

In [157]:
data_2020.isnull().sum()

Age                                                                                                                        0
Gender                                                                                                                     0
City                                                                                                                       0
Position                                                                                                                   0
Total years of experience                                                                                                  0
Years of experience in Germany                                                                                             0
Seniority level                                                                                                            0
Your main technology / programming language                                                                                0


In [158]:
data_2018['Main language at work'].value_counts()

English                                  596
Deutsch                                  134
Russian                                   29
French                                     2
Polish                                     2
Team - Russian; Cross-team - English;      1
Deutsch/Englisch                           1
Name: Main language at work, dtype: int64

In [159]:
data_2020['Your main technology / programming language'].unique()

array(['TypeScript', 'Ruby', 'Javascript / Typescript', 'Javascript',
       'C# .NET', 'AWS, GCP, Python,K8s', 'Typescript', 'PHP', 'Java',
       'Aws Hadoop Postgre Typescript', 'C++', 'Kotlin', 'kotlin',
       'NodeJS', 'iOS', 'Kubernetes', 'Charles', 'SQL', 'Go', 'java',
       'Python', 'Figma', 'JavaScript', 'Go/Python', 'React', 'С#', 'Php',
       'ruby on rails', 'JavaScript/ES6', '.NET', 'Hardware', 'C#',
       'Google Cloud Platform', 'Js', 'android', 'JavaScript ', 'Scala',
       'python', 'C#, .net core', 'VHDL', 'Power BI', 'PHP ', 'none',
       'Android', 'Swift', 'ML', 'php', 'Scala, React.js', 'Ml/Python',
       'JavaScript/TypeScript', 'Ruby on Rails', 'Azure, SAP', 'Frontend',
       'Java, JavaScript', 'yaml', 'Python ', 'JS', 'Java ', '-', 'Agile',
       'C', 'TypeScript, JavaScript', 'Pegasystems platform ',
       'C++, Java, Embedded C', 'Cloud', 'DC Management', '--', 'SWIFT',
       'Java, angular, Aws', 'Swift, objective-c', 'Golang', 'go',
       'Dev

In [160]:
# replacing some entries..
data_2020['Your main technology / programming language'] = data_2020['Your main technology / programming language'].replace(['python ',' python','pythin',' Python','Python ','Pyrhon'], 'Python')
data_2020['Your main technology / programming language'] = data_2020['Your main technology / programming language'].replace(['c++','c++ ',' c++','C++ ',' C++'], 'C++')
# Upper case for all prog languages
data_2020['Your main technology / programming language'] = data_2020['Your main technology / programming language'].str.upper()


In [161]:
data_2020['Your main technology / programming language'].unique()

array(['TYPESCRIPT', 'RUBY', 'JAVASCRIPT / TYPESCRIPT', 'JAVASCRIPT',
       'C# .NET', 'AWS, GCP, PYTHON,K8S', 'PHP', 'JAVA',
       'AWS HADOOP POSTGRE TYPESCRIPT', 'C++', 'KOTLIN', 'NODEJS', 'IOS',
       'KUBERNETES', 'CHARLES', 'SQL', 'GO', 'PYTHON', 'FIGMA',
       'GO/PYTHON', 'REACT', 'С#', 'RUBY ON RAILS', 'JAVASCRIPT/ES6',
       '.NET', 'HARDWARE', 'C#', 'GOOGLE CLOUD PLATFORM', 'JS', 'ANDROID',
       'JAVASCRIPT ', 'SCALA', 'C#, .NET CORE', 'VHDL', 'POWER BI',
       'PHP ', 'NONE', 'SWIFT', 'ML', 'SCALA, REACT.JS', 'ML/PYTHON',
       'JAVASCRIPT/TYPESCRIPT', 'AZURE, SAP', 'FRONTEND',
       'JAVA, JAVASCRIPT', 'YAML', 'JAVA ', '-', 'AGILE', 'C',
       'TYPESCRIPT, JAVASCRIPT', 'PEGASYSTEMS PLATFORM ',
       'C++, JAVA, EMBEDDED C', 'CLOUD', 'DC MANAGEMENT', '--',
       'JAVA, ANGULAR, AWS', 'SWIFT, OBJECTIVE-C', 'GOLANG', 'DEVOPS',
       'NODE.JS', 'R', 'BASH', 'NETWORK', 'NOTHING', 'QLIK BI TOOL, SQL',
       'BLOCKCHAIN', 'ANGULAR', 'AUTONOMOUS DRIVING',
       'JS

## Task 4: Data Visualization using Plotly
**Note:** All the tasks below need to be completed using only Plotly and no other Data Visualization library.

* Create a pie chart to analyze the Company types in the year 2019. Are Consulting / Agency companies more popular than Startups? 
* Create a line plot of the Total years of experience vs the current salary(taking the median salary for each of the different experience years) of the year 2018.
* Now, create the above plot again and add 2 more line plots to the same graph, that display the Total years of experience vs the median Yearly brutto salary (without bonus and stocks) of the year 2019 and 2020.
* Create a bar chart to analyse the popularity of the main technology/ programming languages amongst the respondents in the year 2020. Which technology is the most popular? Which technology is the least popular (with less than 4 responses)?
* Create a pie plot indicating the gender ratio of the respondents in the year 2020.


In [162]:
cmp_tp  = data_2019['Company type'].value_counts()
cmp_tp_un = data_2019['Company type'].dropna().unique()

In [163]:
cmp_tp_un

array(['Startup', 'Product', 'Consulting / Agency',
       'Bodyshop / Outsource', 'University', 'Bank', 'Outsource'],
      dtype=object)

In [164]:
cmp_fig = [cmp_tp[0],cmp_tp[1],cmp_tp[2],cmp_tp[3],cmp_tp[4],cmp_tp[5],cmp_tp[6]]

In [165]:
cmp_fig

[650, 181, 117, 30, 6, 6, 1]

In [166]:
df_cmpfig = pd.DataFrame(cmp_fig,columns=["Number"], index= cmp_tp_un)

In [167]:
df_cmpfig

Unnamed: 0,Number
Startup,650
Product,181
Consulting / Agency,117
Bodyshop / Outsource,30
University,6
Bank,6
Outsource,1


In [168]:
fig = px.pie(df_cmpfig, values='Number', names=df_cmpfig.index, title='Company types in the year 2019')
fig.show()


**Startups are the majority leading over Consulting/Agencies**

In [169]:
data_2018['Current Salary'].median()

65000.0

In [170]:
data_2018['Years of experience'].groupby(by=data_2018['Current Salary']).median()

Current Salary
10300.0      1.000000
13000.0      2.000000
15000.0     10.000000
17532.0      2.000000
19200.0      2.000000
              ...    
150000.0     8.548497
165000.0    15.000000
176000.0    13.000000
180000.0    18.000000
200000.0    20.000000
Name: Years of experience, Length: 140, dtype: float64

**In the line plot for years of experience vs salary, you need to take the median salary for each experience year. For eg. median of all the salaries of people with years of experience as 5 and so on. Otherwise, the plots will be too cluttered and won't make sense.**

In [171]:
grp = data_2018.groupby('Years of experience').groups
grp

{0.0: [236, 708, 714], 0.5: [120, 158], 1.0: [35, 47, 54, 57, 77, 263, 327, 358, 393, 397, 422, 451, 634, 649, 684, 742, 749, 760, 761, 762], 1.5: [209, 581], 2.0: [34, 100, 186, 196, 353, 430, 455, 469, 470, 487, 492, 493, 535, 569, 589, 693, 724], 2.5: [157, 190, 266, 294, 454], 3.0: [8, 15, 44, 61, 70, 73, 87, 96, 104, 108, 110, 123, 131, 146, 154, 193, 194, 232, 234, 251, 254, 338, 363, 370, 398, 407, 434, 436, 445, 538, 601, 611, 672, 677, 679, 685, 706, 725, 748], 4.0: [17, 24, 33, 58, 63, 65, 74, 90, 91, 113, 128, 129, 138, 176, 181, 191, 201, 204, 246, 247, 257, 262, 310, 314, 332, 351, 362, 365, 367, 395, 396, 444, 466, 471, 507, 529, 532, 539, 556, 575, 595, 604, 657, 666, 669, 670, 750], 4.5: [588], 5.0: [6, 22, 29, 38, 64, 67, 69, 78, 82, 86, 107, 116, 121, 126, 141, 144, 148, 172, 173, 174, 182, 195, 202, 203, 211, 216, 220, 230, 238, 244, 250, 256, 259, 271, 277, 283, 291, 293, 299, 305, 308, 312, 315, 318, 328, 329, 337, 341, 357, 377, 405, 415, 428, 433, 443, 463, 482, 

In [172]:
# year 2018
groupe = data_2018.groupby('Years of experience')
ddf = groupe['Current Salary'].agg(np.median)

In [173]:
d1 = pd.DataFrame(ddf)
d1.head()

Unnamed: 0_level_0,Current Salary
Years of experience,Unnamed: 1_level_1
0.0,50000.0
0.5,46000.0
1.0,46000.0
1.5,58000.0
2.0,50000.0


In [174]:
fig = px.line(d1, x=d1.index, y='Current Salary', title='Years of XP vs Median Current Salary of Each Year of XP')
fig.show()

In [175]:
# year 2019 
groupe_2019 = data_2019.groupby('Years of experience')
dd_19 = groupe_2019['Yearly brutto salary (without bonus and stocks)'].agg(np.median)

In [176]:
d2 = pd.DataFrame(dd_19)
d2.head()

Unnamed: 0_level_0,Yearly brutto salary (without bonus and stocks)
Years of experience,Unnamed: 1_level_1
0.0,55000.0
1.0,46800.0
2.0,52500.0
3.0,55000.0
4.0,62000.0


In [177]:
d2.head(50)

Unnamed: 0_level_0,Yearly brutto salary (without bonus and stocks)
Years of experience,Unnamed: 1_level_1
0.0,55000.0
1.0,46800.0
2.0,52500.0
3.0,55000.0
4.0,62000.0
5.0,65000.0
6.0,68000.0
7.0,67000.0
8.0,74500.0
9.0,74500.0


In [178]:
data_2020['Total years of experience'].unique()

array(['5', '7', '12', '4', '17', '6', '8', '15', '2', '25', '10', '14',
       '11', '18', '13', '30', '3', '40', '26', '23', '9', '19', '20',
       '5.5', '22', '16', '0.8', '1', '1.5', '6.5', '21', '7.5', '2.5',
       '28', '29', '1,5', '24', '0', '4.5', '27',
       '1 (as QA Engineer) / 11 in total', '2,5', '15, thereof 8 as CTO',
       '31', '6 (not as a data scientist, but as a lab scientist)', '383',
       '3.5', 'less than year'], dtype=object)

In [179]:
data_2020['Total years of experience'] = data_2020['Total years of experience'].replace(['1 (as QA Engineer) / 11 in total'], '11')
data_2020['Total years of experience'] = data_2020['Total years of experience'].replace(['15, thereof 8 as CTO'], '15')
data_2020['Total years of experience'] = data_2020['Total years of experience'].replace(['6 (not as a data scientist, but as a lab scientist)'], '6')
data_2020['Total years of experience'] = data_2020['Total years of experience'].replace(['less than year'], '0.5')
data_2020['Total years of experience'] = data_2020['Total years of experience'].replace(['383'], '38.3')
data_2020['Total years of experience'] = data_2020['Total years of experience'].replace(['1,5'], '1.5')
data_2020['Total years of experience'] = data_2020['Total years of experience'].replace(['2,5'], '2.5')


In [180]:
#dff2 = data_2020['Total years of experience'].dropna().unique()
#dff2

In [181]:
# year 2020 
groupe_2020 = data_2020.groupby('Total years of experience')
dd_20 = groupe_2020['Yearly brutto salary (without bonus and stocks) in EUR'].agg(np.median)

In [182]:
dd_20.head(50)

Total years of experience
0        44000.0
0.5      16320.0
0.8      29750.0
1        48000.0
1.5      49925.0
10       75000.0
11       75000.0
12       75000.0
13       83000.0
14       80000.0
15       78000.0
16       83000.0
17       76000.0
18       85500.0
19       87000.0
2        50000.0
2.5      60000.0
20       80000.0
21       80000.0
22       70000.0
23      100000.0
24      130000.0
25       75000.0
26       93000.0
27       73500.0
28       78500.0
29       28800.0
3        56000.0
3.5      62500.0
30       64500.0
31      110000.0
38.3     70000.0
4        60000.0
4.5      67500.0
40       70000.0
5        65000.0
5.5      48000.0
6        67750.0
6.5      58000.0
7        70000.0
7.5      68000.0
8        68000.0
9        74000.0
Name: Yearly brutto salary (without bonus and stocks) in EUR, dtype: float64

In [183]:
d3 = pd.DataFrame(dd_20)
d3.head(43)

Unnamed: 0_level_0,Yearly brutto salary (without bonus and stocks) in EUR
Total years of experience,Unnamed: 1_level_1
0.0,44000.0
0.5,16320.0
0.8,29750.0
1.0,48000.0
1.5,49925.0
10.0,75000.0
11.0,75000.0
12.0,75000.0
13.0,83000.0
14.0,80000.0


In [184]:
d3.index = d3.index.astype(float)

In [185]:
d  = {'XP Years':d3.index, 'Yearly brutto salary (without bonus and stocks) in EUR': d3['Yearly brutto salary (without bonus and stocks) in EUR']}
d5 = pd.DataFrame(d)

In [186]:
#d4 = d3.copy()
d5 = d5.sort_values(by='XP Years', ascending=True)
d5.head(44)

Unnamed: 0_level_0,XP Years,Yearly brutto salary (without bonus and stocks) in EUR
Total years of experience,Unnamed: 1_level_1,Unnamed: 2_level_1
0.0,0.0,44000.0
0.5,0.5,16320.0
0.8,0.8,29750.0
1.0,1.0,48000.0
1.5,1.5,49925.0
2.0,2.0,50000.0
2.5,2.5,60000.0
3.0,3.0,56000.0
3.5,3.5,62500.0
4.0,4.0,60000.0


In [187]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=d2.index, y=d2['Yearly brutto salary (without bonus and stocks)'],
                    mode='lines',
                    name='2019 Median Yearly brutto through XP Years'))
fig.add_trace(go.Scatter(x=d5.index, y=d5['Yearly brutto salary (without bonus and stocks) in EUR'],
                    mode='lines',
                    name='2020 Median Yearly brutto through XP Years'))
fig.show()

**Yearly brutto salary (without bonus and stocks) in EUR is correlated in the 2 years 2019 & 2020 but there are some ups and downs in year 2020 which are:**

  * year 2020 :
* Ups : at 24 & 31 years of XP
* Downs : 0.5 & 5.5 & 6.5 & 29 and above 38.3 is goins constantly.

  * for year 2019 :
* it knows an up in 28 years of XP.

In [188]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=d1.index, y=d1['Current Salary'],
                    mode='lines',
                    name='2018 Median Current Salary through XP Years'))
fig.add_trace(go.Scatter(x=d2.index, y=d2['Yearly brutto salary (without bonus and stocks)'],
                    mode='lines',
                    name='2019 Median Yearly brutto through XP Years'))
fig.add_trace(go.Scatter(x=d5.index, y=d5['Yearly brutto salary (without bonus and stocks) in EUR'],
                    mode='lines',
                    name='2020 Median Yearly brutto through XP Years'))
fig.show()

**Same goes for the 3 years 2018, 2019, and 2019**

**Current Salary for 2018, and Yearly brutto salary (without bonus and stocks) in EUR is correlated in the 2 years 2019 & 2020 but there are some ups and downs in them all,  which are:**

  * year 2020 :
* Ups : at 24 & 31 years of XP
* Downs : `0.5` & `5.5` & `6.5` & `29` and above 38.3 is goins constantly.

  * for year 2019 :
* it knows an ups in `0`,`4`,`6`,`8 till 12`, `21,22,23`, and `28` years of XP.
* it knows a downs in `13`, `16`, and `19`.

  * for year 2018 :
* it knows an ups in `1.5`, `16`, & `25` years of XP.
* it knows a downs in `4.5`, `7.5`, `9`, `15`, `18`, and `38`. 

In [189]:
languages_value = data_2020['Other technologies/programming languages you use often'].value_counts()
vals = data_2020['Other technologies/programming languages you use often'].dropna().unique()
lgval = languages_value.unique()

In [190]:
langtools = pd.DataFrame(data=languages_value,columns=['Popularity'],index=vals)

In [191]:
languages_value

Javascript / Typescript                                                         201
Python                                                                           37
SQL                                                                              31
AWS, Docker                                                                      16
Kotlin                                                                           15
                                                                               ... 
PHP, Javascript / Typescript, Java / Scala, SQL, AWS, Docker                      1
Java / Scala, SQL, AWS, Azure                                                     1
Python, Kotlin, Java / Scala, SQL, Go, AWS, Google Cloud, Kubernetes, Docker      1
Python, Javascript / Typescript, SQL, Kubernetes, Docker                          1
Python, R                                                                         1
Name: Other technologies/programming languages you use often, Length: 562, d

In [192]:
langtools.sort_values(by='Popularity', ascending=False)

Unnamed: 0,Popularity
"Kotlin, Javascript / Typescript",
Javascript / Typescript,
"Javascript / Typescript, Docker",
".NET, SQL, AWS, Docker",
"Python, AWS, Google Cloud, Kubernetes, Docker",
...,...
"Javascript / Typescript, Java / Scala, SQL, Go, AWS, Docker",
"Javascript / Typescript, SQL, Go, AWS, Google Cloud, Azure, Kubernetes",
"Python, C/C++, Javascript / Typescript, Java / Scala, Perl, AWS, Docker, Networking, Data Center",
"Javascript / Typescript, Docker, HTML, CSS; Adobe XD",


In [193]:
langtools['Popularity'] = languages_value
langtools = langtools.sort_values(by='Popularity', ascending=False)

In [194]:
fig = px.bar(langtools, y ="Popularity" , x =langtools.index ,title="Popular Programming Languages throuh 2020 ",
              color_continuous_scale ='RdBu',color="Popularity",height=1200)
fig.update_xaxes(tickangle=45)
fig.show()

**From the graph we see that the 4 popular Programming languages are :**

* **JavaScript / TypeScript**
* **Python** 
* **SQL**
* **AWS, Docker**

In [195]:
# Pie chart for Gender...

gender_count  = data_2020['Gender'].value_counts()
gender_count_un = data_2020['Gender'].dropna().unique()


In [196]:
gender_count

Male       1059
Female      192
Diverse       2
Name: Gender, dtype: int64

In [197]:
gender_count_un

array(['Male', 'Female', 'Diverse'], dtype=object)

In [198]:
genders = [gender_count[0], gender_count[1],gender_count[2]]

In [199]:
df_gender = pd.DataFrame(genders,columns=["Gender_Ratio"], index= gender_count_un)

In [200]:
fig = px.pie(df_gender, values='Gender_Ratio', names=df_gender.index, title='Genders in the year 2020')
fig.show()


## Bonus Section [Optional but carries bonus marks]
This dataset is as raw and real as it can get while conducting yearly surveys. You might have observed that the data is not clean and structured and requires some thorough cleaning before deriving meaningful plots. When combined with the power of Plotly and Dash, there are endless possibilities for the insightful visualizations you can create. 

This section is to let you experiment, explore and create as many visualizations as you’d like. You never know, if we like the creativity and the extra work, you might receive some bonus marks!


In [201]:
fig = px.line(d1, x=d1.index, y='Current Salary',color="Current Salary", title='Years of XP vs Median Salary for Each Year of XP, colored by Cur. Salary -- Year 2018')
fig.show()

In [202]:
fig = px.line(data_2018, x="Years of experience", y='Current Salary',color="Current Salary", title='Years of XP vs Salary-- year 2018')
fig.show()

# Conclusion
This brings us to the end of the assignment and to the bootcamp. We hope you had a great learning time. :)

Now, you can submit your notebook for assessment. 

In [203]:
data_2018.columns

Index(['Age', 'Gender', 'City', 'Position', 'Years of experience',
       'Seniority level', 'Current Salary', 'Salary one year ago',
       'Salary two years ago', 'Are you getting any Stock Options?',
       'Main language at work', 'Company size', 'Company type'],
      dtype='object')

In [204]:
data_2018['Year'] = 2018

In [205]:
data_2019['Year'] = 2019 
data_2020['Year'] = 2020

In [206]:
#df =pd.concat([data_2018, data_2019,data_2020])

In [207]:
data_2018.columns

Index(['Age', 'Gender', 'City', 'Position', 'Years of experience',
       'Seniority level', 'Current Salary', 'Salary one year ago',
       'Salary two years ago', 'Are you getting any Stock Options?',
       'Main language at work', 'Company size', 'Company type', 'Year'],
      dtype='object')

In [208]:
data_2019.columns

Index(['Age', 'Gender', 'City', 'Seniority level', 'Position',
       'Years of experience', 'Your main technology / programming language',
       'Yearly brutto salary (without bonus and stocks)', 'Yearly bonus',
       'Yearly stocks',
       'Yearly brutto salary (without bonus and stocks) one year ago. Only answer if staying in same country',
       'Yearly bonus one year ago. Only answer if staying in same country',
       'Yearly stocks one year ago. Only answer if staying in same country',
       'Number of vacation days', 'Number of home office days per month',
       'Main language at work', 'Company name', 'Company size', 'Company type',
       'Сontract duration', 'Company business sector', 'Year'],
      dtype='object')

In [209]:
data_2020.columns

Index(['Age', 'Gender', 'City', 'Position', 'Total years of experience',
       'Years of experience in Germany', 'Seniority level',
       'Your main technology / programming language',
       'Other technologies/programming languages you use often',
       'Yearly brutto salary (without bonus and stocks) in EUR',
       'Yearly bonus + stocks in EUR',
       'Annual brutto salary (without bonus and stocks) one year ago. Only answer if staying in the same country',
       'Annual bonus+stocks one year ago. Only answer if staying in same country',
       'Number of vacation days', 'Employment status', 'Сontract duration',
       'Main language at work', 'Company size', 'Company type',
       'Have you lost your job due to the coronavirus outbreak?',
       'Have you been forced to have a shorter working week (Kurzarbeit)? If yes, how many hours per week',
       'Have you received additional monetary support from your employer due to Work From Home? If yes, how much in 2020 in EUR',
  

In [210]:
df_18 = data_2018.copy()
df_19 = data_2019.copy()
df_20 = data_2020.copy()

In [211]:
df_18.rename(columns={'Current Salary':'Salary_in_EUR'},inplace=True)
df_19.rename(columns={'Yearly brutto salary (without bonus and stocks)':'Salary_in_EUR'},inplace=True)
df_20.rename(columns={'Yearly brutto salary (without bonus and stocks) in EUR':'Salary_in_EUR','Total years of experience':'Years of experience'},inplace=True)


In [212]:
df_19.columns

Index(['Age', 'Gender', 'City', 'Seniority level', 'Position',
       'Years of experience', 'Your main technology / programming language',
       'Salary_in_EUR', 'Yearly bonus', 'Yearly stocks',
       'Yearly brutto salary (without bonus and stocks) one year ago. Only answer if staying in same country',
       'Yearly bonus one year ago. Only answer if staying in same country',
       'Yearly stocks one year ago. Only answer if staying in same country',
       'Number of vacation days', 'Number of home office days per month',
       'Main language at work', 'Company name', 'Company size', 'Company type',
       'Сontract duration', 'Company business sector', 'Year'],
      dtype='object')

In [213]:
cols_18_19_20 = ['Age', 
'Gender', 
'City', 
'Position', 
'Years of experience',
'Seniority level', 
'Salary_in_EUR',
'Main language at work', 
'Company size', 
'Company type', 
'Year']

In [214]:
# I will take only the Cols above to Create a good Viz usin Dash
df_sur_18 = df_18[cols_18_19_20]
df_sur_19 = df_19[cols_18_19_20]
df_sur_20 = df_20[cols_18_19_20]

In [215]:
df_sur_18.head()

Unnamed: 0,Age,Gender,City,Position,Years of experience,Seniority level,Salary_in_EUR,Main language at work,Company size,Company type,Year
0,43.0,M,München,QA Ingenieur,11.0,Senior,77000.0,Deutsch,100-1000,Product,2018
1,33.0,F,München,Senior PHP Magento developer,8.0,Senior,65000.0,Deutsch,50-100,Product,2018
2,32.0,M,München,Software Engineer,10.0,Senior,88000.0,Deutsch,1000+,Product,2018
3,25.0,M,München,Senior Frontend Developer,6.0,Senior,78000.0,English,1000+,Product,2018
4,39.0,M,München,UX Designer,10.0,Senior,69000.0,English,100-1000,Ecom retailer,2018


In [216]:
df_sur_19.head()

Unnamed: 0,Age,Gender,City,Position,Years of experience,Seniority level,Salary_in_EUR,Main language at work,Company size,Company type,Year
0,33.0,Male,Berlin,Fullstack Developer,13.0,Senior,64000.0,English,50-100,Startup,2019
1,29.0,Male,Berlin,Backend Developer,3.0,Middle,55000.0,English,10-50,Product,2019
2,32.4161,Male,Berlin,Mobile Developer,4.0,Middle,70000.0,English,1000+,Startup,2019
3,30.0,Male,Berlin,Backend Developer,6.0,Senior,63000.0,English,100-1000,Product,2019
4,32.0,Male,Berlin,Embedded Developer,10.0,Senior,66000.0,English,50-100,Product,2019


In [217]:
df_sur_20.head()

Unnamed: 0,Age,Gender,City,Position,Years of experience,Seniority level,Salary_in_EUR,Main language at work,Company size,Company type,Year
0,26.0,Male,Munich,Software Engineer,5,Senior,80000.0,English,51-100,Product,2020
1,26.0,Male,Berlin,Backend Developer,7,Senior,80000.0,English,101-1000,Product,2020
2,29.0,Male,Berlin,Software Engineer,12,Lead,120000.0,English,101-1000,Product,2020
3,28.0,Male,Berlin,Frontend Developer,4,Junior,54000.0,English,51-100,Startup,2020
4,37.0,Male,Berlin,Backend Developer,17,Senior,62000.0,English,101-1000,Product,2020


In [218]:
survey_18_20 =pd.concat([df_sur_18, df_sur_19,df_sur_20])

In [219]:
survey_18_20

Unnamed: 0,Age,Gender,City,Position,Years of experience,Seniority level,Salary_in_EUR,Main language at work,Company size,Company type,Year
0,43.0,M,München,QA Ingenieur,11,Senior,77000.0,Deutsch,100-1000,Product,2018
1,33.0,F,München,Senior PHP Magento developer,8,Senior,65000.0,Deutsch,50-100,Product,2018
2,32.0,M,München,Software Engineer,10,Senior,88000.0,Deutsch,1000+,Product,2018
3,25.0,M,München,Senior Frontend Developer,6,Senior,78000.0,English,1000+,Product,2018
4,39.0,M,München,UX Designer,10,Senior,69000.0,English,100-1000,Ecom retailer,2018
...,...,...,...,...,...,...,...,...,...,...,...
1248,31.0,Male,Berlin,Backend Developer,9,Senior,70000.0,English,51-100,Product,2020
1249,33.0,Male,Berlin,Researcher/ Consumer Insights Analyst,10,Senior,60000.0,English,1000+,Product,2020
1250,39.0,Male,Munich,IT Operations Manager,15,Lead,110000.0,English,101-1000,eCommerce,2020
1251,26.0,Male,Saarbrücken,Frontend Developer,7,Middle,38350.0,German,101-1000,Product,2020


In [220]:
survey_18_20.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3009 entries, 0 to 1252
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Age                    3009 non-null   float64
 1   Gender                 3009 non-null   object 
 2   City                   3009 non-null   object 
 3   Position               3009 non-null   object 
 4   Years of experience    3009 non-null   object 
 5   Seniority level        3009 non-null   object 
 6   Salary_in_EUR          3009 non-null   float64
 7   Main language at work  3009 non-null   object 
 8   Company size           3009 non-null   object 
 9   Company type           3009 non-null   object 
 10  Year                   3009 non-null   int64  
dtypes: float64(2), int64(1), object(8)
memory usage: 282.1+ KB


In [221]:
survey_18_20['Gender'].value_counts()

Male       1897
M           660
Female      345
F           105
Diverse       2
Name: Gender, dtype: int64

In [222]:
survey_18_20['Gender'] = survey_18_20['Gender'].replace('M','Male')
survey_18_20['Gender'] = survey_18_20['Gender'].replace('F','Female')

In [223]:
survey_18_20['Gender'].value_counts()

Male       2557
Female      450
Diverse       2
Name: Gender, dtype: int64

In [224]:
survey_18_20['Age'] = survey_18_20['Age'].astype(int)
survey_18_20['Age'].value_counts()

32    459
30    274
33    231
31    218
29    187
28    181
35    181
34    171
27    136
36    126
26    119
37    106
38     95
25     95
40     64
24     60
39     55
42     43
43     28
41     28
23     25
45     24
44     19
22     18
46     13
48      9
47      8
21      6
52      5
50      5
49      5
54      3
51      2
56      2
20      2
69      1
60      1
65      1
53      1
59      1
66      1
Name: Age, dtype: int64

In [225]:
survey_18_20.columns

Index(['Age', 'Gender', 'City', 'Position', 'Years of experience',
       'Seniority level', 'Salary_in_EUR', 'Main language at work',
       'Company size', 'Company type', 'Year'],
      dtype='object')

In [226]:
survey_18_20.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3009 entries, 0 to 1252
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Age                    3009 non-null   int32  
 1   Gender                 3009 non-null   object 
 2   City                   3009 non-null   object 
 3   Position               3009 non-null   object 
 4   Years of experience    3009 non-null   object 
 5   Seniority level        3009 non-null   object 
 6   Salary_in_EUR          3009 non-null   float64
 7   Main language at work  3009 non-null   object 
 8   Company size           3009 non-null   object 
 9   Company type           3009 non-null   object 
 10  Year                   3009 non-null   int64  
dtypes: float64(1), int32(1), int64(1), object(8)
memory usage: 270.3+ KB


In [227]:
survey_18_20['City'] = survey_18_20['City'].str.upper()
survey_18_20['City'].value_counts()

BERLIN            1431
MUNICH             476
MÜNCHEN            249
FRANKFURT          127
AMSTERDAM          104
                  ... 
WÜRZBURG             1
MEMMINGEN            1
WARSAW, POLAND       1
FRANKONIA            1
CRACOVIA             1
Name: City, Length: 180, dtype: int64

In [228]:
survey_18_20['Position'] = survey_18_20['Position'].str.upper()
survey_18_20['Position'].value_counts()

BACKEND DEVELOPER                434
SOFTWARE ENGINEER                433
DATA SCIENTIST                   248
FRONTEND DEVELOPER               175
DEVOPS                           126
                                ... 
IT CONSULTING                      1
SALES OPS TEAM LEAD                1
SOFTWARE DEVELOPMENT ENGINEER      1
BUSINESS ANALYST/RE                1
SEM MANAGER                        1
Name: Position, Length: 452, dtype: int64

In [243]:
survey_18_20['Years of experience'] = survey_18_20['Years of experience'].astype(float).round(1)
survey_18_20['Years of experience'].value_counts()

10.0    367
5.0     300
8.0     240
6.0     223
7.0     219
4.0     183
3.0     174
12.0    164
15.0    154
9.0     151
2.0     129
11.0    126
1.0      93
13.0     85
14.0     72
20.0     64
16.0     47
18.0     44
8.5      33
17.0     28
19.0     18
25.0     15
0.0      15
2.5      10
22.0      9
30.0      7
1.5       6
21.0      4
0.5       3
4.5       3
28.0      3
7.5       2
24.0      2
23.0      2
27.0      2
3.5       2
0.8       2
38.0      1
38.3      1
40.0      1
26.0      1
5.5       1
6.5       1
29.0      1
31.0      1
Name: Years of experience, dtype: int64

In [232]:
survey_18_20['Seniority level'] = survey_18_20['Seniority level'].str.upper()
survey_18_20['Seniority level'].value_counts()

SENIOR                                       1698
MIDDLE                                        844
LEAD                                          201
JUNIOR                                        192
HEAD                                           50
PRINCIPAL                                       6
INTERN                                          2
STUDENT                                         2
SELF EMPLOYED                                   1
VP                                              1
C-LEVEL EXECUTIVE MANAGER                       1
DIRECTOR                                        1
WORKING STUDENT                                 1
MANAGER                                         1
NO IDEA, THERE ARE NO RANGES IN THE FIRM        1
ENTRY LEVEL                                     1
WORK CENTER MANAGER                             1
NO LEVEL                                        1
KEY                                             1
NO LEVEL                                        1


In [254]:
survey_18_20['Salary_in_EUR'] = survey_18_20['Salary_in_EUR'].astype(float).round(1)
survey_18_20['Salary_in_EUR'].value_counts()

60000.0        211
65000.0        204
70000.0        196
75000.0        169
80000.0        131
              ... 
500000000.0      1
74200.0          1
6000.0           1
11000.0          1
95500.0          1
Name: Salary_in_EUR, Length: 307, dtype: int64

In [255]:
survey_18_20 = survey_18_20[survey_18_20['Salary_in_EUR'] <= 200000.0]
survey_18_20['Salary_in_EUR'].value_counts()

60000.0    211
65000.0    204
70000.0    196
75000.0    169
80000.0    131
          ... 
6000.0       1
11000.0      1
33200.0      1
39400.0      1
27977.0      1
Name: Salary_in_EUR, Length: 299, dtype: int64

In [256]:
survey_18_20['Main language at work'].value_counts()

English                                  2382
Deutsch                                   315
German                                    186
Russian                                    78
French                                      8
Polish                                      6
Italian                                     5
Spanish                                     4
Czech                                       2
English and German                          2
Русский                                     2
Polish+English                              1
Russian, English                            1
Team - Russian; Cross-team - English;       1
Ukrainian                                   1
English+Deutsch                             1
both                                        1
Deutsch/Englisch                            1
50/50                                       1
Deuglisch                                   1
Dutch                                       1
Name: Main language at work, dtype

In [260]:
survey_18_20['Main language at work'] = survey_18_20['Main language at work'].replace(['Russian, English','Team - Russian; Cross-team - English;'],'Russian_&_English')
survey_18_20['Main language at work'] = survey_18_20['Main language at work'].replace(['English+Deutsch','Deutsch/Englisch','Deuglisch'],'Deutsch_&_English')
survey_18_20['Main language at work'] = survey_18_20['Main language at work'].replace(['English and German','50/50','both'],'German_&_English')
survey_18_20['Main language at work'] = survey_18_20['Main language at work'].replace('Dutch','Deutsch')
survey_18_20['Main language at work'] = survey_18_20['Main language at work'].replace('Polish+English','Polish_&_English')




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/

In [261]:
survey_18_20['Main language at work'].value_counts()

English              2382
Deutsch               316
German                186
Russian                78
French                  8
Polish                  6
Italian                 5
Spanish                 4
German_&_English        4
Deutsch_&_English       3
Czech                   2
Русский                 2
Russian_&_English       2
Ukrainian               1
Polish_&_English        1
Name: Main language at work, dtype: int64

In [262]:
survey_18_20['Company size'].value_counts()

1000+       1026
100-1000     654
101-1000     403
50-100       252
10-50        222
11-50        174
51-100       147
up to 10     122
Name: Company size, dtype: int64

In [266]:
survey_18_20['Company type']= survey_18_20['Company type'].str.upper()
survey_18_20['Company type'].value_counts()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



PRODUCT                               1914
STARTUP                                576
CONSULTING / AGENCY                    259
AGENCY                                  74
BODYSHOP / OUTSOURCE                    30
                                      ... 
PROJECT-BASED SOFTWARE DEVELOPMENT       1
SYSTEMHAUS                               1
SCIENCE INSTITUTE                        1
IT CONSULTANCY                           1
FREELANCE                                1
Name: Company type, Length: 96, dtype: int64

In [276]:
survey_18_20.to_csv('survey_18_20.csv')
print('succesfully saved')

succesfully saved


# to launch app see Script : 

* Script of Dash App Dashboard: 
    
    **dash_app_IT_Sal_Survey_Ayoub_Berdeddouch.py**

run through line of commande : `python dash_app_IT_Sal_Survey_Ayoub_Berdeddouch.py`