# Overview

The main aim of this project is to perform some data cleaning in the [JBeans Raw Dataset](https://drive.google.com/drive/folders/1giaGOhJYWIXfZzy-zxmbrUIOB9Y_ROdu) and get it ready for cleaning 

The main aim of this project is to practice some of the skills I have learnt in data cleaning and data analysis. 


#### Requirements

The directory of this notebook contains a CSV file which contains all the raw data in the Jeans Raw Dataset. If the CSV file is missing you can download it [here](https://drive.google.com/drive/folders/1giaGOhJYWIXfZzy-zxmbrUIOB9Y_ROdu).

The following assumptions have been made:
- The CSV file is in the current directory
- The CSV file is named raw_survey_data.csv




In [1]:
# if raw_survey_data.csv is in the current directory it would be outputed when you run 
# this cell 

!ls | grep raw_survey_data.csv$

raw_survey_data.csv


In [364]:
# importing packages with their natural alias

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math


In [3]:

# Here we store all the constants we need in this project
CSV_FILENAME = 'raw_survey_data.csv'


In [5]:
# Loading the CSV to a dataframe
raw_data = pd.read_csv(CSV_FILENAME, dtype=np.object)


In [6]:
# Performing basic checks on the data 
raw_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19835 entries, 0 to 19834
Columns: 284 entries, Unnamed: 0 to What country do you live in?
dtypes: object(284)
memory usage: 43.0+ MB


In [7]:
raw_data.columns

Index(['Unnamed: 0',
       'Is Python the main language you use for your current projects?',
       'None:What other language(s) do you use?',
       'Java:What other language(s) do you use?',
       'JavaScript:What other language(s) do you use?',
       'C/C++:What other language(s) do you use?',
       'PHP:What other language(s) do you use?',
       'C#:What other language(s) do you use?',
       'Ruby:What other language(s) do you use?',
       'Bash / Shell:What other language(s) do you use?',
       ...
       'Technical support:Which of the following best describes your job role(s)?',
       'Data analyst:Which of the following best describes your job role(s)?',
       'Business analyst:Which of the following best describes your job role(s)?',
       'Team lead:Which of the following best describes your job role(s)?',
       'Product manager:Which of the following best describes your job role(s)?',
       'CIO / CEO / CTO:Which of the following best describes your job role(s)?

In [8]:

raw_data.shape

(19835, 284)

In [9]:
raw_data.head()

Unnamed: 0.1,Unnamed: 0,Is Python the main language you use for your current projects?,None:What other language(s) do you use?,Java:What other language(s) do you use?,JavaScript:What other language(s) do you use?,C/C++:What other language(s) do you use?,PHP:What other language(s) do you use?,C#:What other language(s) do you use?,Ruby:What other language(s) do you use?,Bash / Shell:What other language(s) do you use?,...,Technical support:Which of the following best describes your job role(s)?,Data analyst:Which of the following best describes your job role(s)?,Business analyst:Which of the following best describes your job role(s)?,Team lead:Which of the following best describes your job role(s)?,Product manager:Which of the following best describes your job role(s)?,CIO / CEO / CTO:Which of the following best describes your job role(s)?,Systems analyst:Which of the following best describes your job role(s)?,Other Write In::Which of the following best describes your job role(s)?,Could you tell us your age range?,What country do you live in?
0,1,"No, I don’t use Python for my current projects",,,,,,,,,...,,,,,,,,,50–59,Antigua and Barbuda
1,2,"No, I don’t use Python for my current projects",,,JavaScript,,PHP,,,Bash / Shell,...,,,,,,,,,30–39,Italy
2,3,"No, I don’t use Python for my current projects",,,JavaScript,,,,,Bash / Shell,...,,,,,,,,,50–59,United States
3,4,"No, I don’t use Python for my current projects",,Java,JavaScript,,,,,Bash / Shell,...,,,,,,,,,30–39,United Kingdom
4,5,"No, I don’t use Python for my current projects",,Java,,,,,,,...,,,,,,,,,21–29,United States


In [10]:
# List of all columns that were asked

for index, columns in enumerate(raw_data.columns):
    print('#{}: {}'.format(index+1,  columns))
    

#1: Unnamed: 0
#2: Is Python the main language you use for your current projects?
#3: None:What other language(s) do you use?
#4: Java:What other language(s) do you use?
#5: JavaScript:What other language(s) do you use?
#6: C/C++:What other language(s) do you use?
#7: PHP:What other language(s) do you use?
#8: C#:What other language(s) do you use?
#9: Ruby:What other language(s) do you use?
#10: Bash / Shell:What other language(s) do you use?
#11: Objective-C:What other language(s) do you use?
#12: Go:What other language(s) do you use?
#13: Visual Basic:What other language(s) do you use?
#14: Scala:What other language(s) do you use?
#15: SQL:What other language(s) do you use?
#16: Kotlin:What other language(s) do you use?
#17: R:What other language(s) do you use?
#18: Swift:What other language(s) do you use?
#19: Clojure:What other language(s) do you use?
#20: Perl:What other language(s) do you use?
#21: Rust:What other language(s) do you use?
#22: Groovy:What other language(s) do you 

###  Characteristics of the Dataset
We see from the above checks that the:

- dataset has `19,835` rows and `284` columns
- questions asked were placed in columns
- each option in a multi-choice question is stored as a new column in the dataset
- missing values are stored as `NaN`
- once the user says they dont use python as their main language for the first question then the only relevant question for that user is the last 2 namely:

>Could you tell us your age range?

**AND**

>What country do you live in?



### Our goal

Looking at how the data is structured, I think that the best way to structure this data for analysis is to extract the answers of each question and store them in different sheets in an excel file with each sheet having the name of the excel files. 

This, I believe would make it easy to analyze and process the data. 

In summary, we have to do the following:
- each question and extract their answers
- clean data extracted and get it ready for analysis
- export cleaned data into an excel sheet whose sheet name is the question
- save the data to a file

The final output of this notebook therefore is the excel_file. Let us start by creating 2 subsets of the origin dataframe that contains:
- Answers to questions when the user uses python
- Answers to questions when the user does not use python




### Question 1

The first question is what we would use to create two observation subsets

In [117]:
# observations for python files that do not have
NOT_PYTHON_STR = 'No, I don’t use Python for my current projects'

In [118]:
# extracting the observations into two dataframes
NON_PYTHON_OBSERVATIONS_DF = raw_data[raw_data.iloc[:, 1] == NOT_PYTHON_STR]
PYTHON_OBSERVATIONS_DF = raw_data[raw_data.iloc[:, 1] != NOT_PYTHON_STR]


In [119]:
print(NON_PYTHON_OBSERVATIONS_DF.iloc[:, 1].value_counts())
NON_PYTHON_OBSERVATIONS_DF.head()

No, I don’t use Python for my current projects    1404
Name: Is Python the main language you use for your current projects?, dtype: int64


Unnamed: 0.1,Unnamed: 0,Is Python the main language you use for your current projects?,None:What other language(s) do you use?,Java:What other language(s) do you use?,JavaScript:What other language(s) do you use?,C/C++:What other language(s) do you use?,PHP:What other language(s) do you use?,C#:What other language(s) do you use?,Ruby:What other language(s) do you use?,Bash / Shell:What other language(s) do you use?,...,Technical support:Which of the following best describes your job role(s)?,Data analyst:Which of the following best describes your job role(s)?,Business analyst:Which of the following best describes your job role(s)?,Team lead:Which of the following best describes your job role(s)?,Product manager:Which of the following best describes your job role(s)?,CIO / CEO / CTO:Which of the following best describes your job role(s)?,Systems analyst:Which of the following best describes your job role(s)?,Other Write In::Which of the following best describes your job role(s)?,Could you tell us your age range?,What country do you live in?
0,1,"No, I don’t use Python for my current projects",,,,,,,,,...,,,,,,,,,50–59,Antigua and Barbuda
1,2,"No, I don’t use Python for my current projects",,,JavaScript,,PHP,,,Bash / Shell,...,,,,,,,,,30–39,Italy
2,3,"No, I don’t use Python for my current projects",,,JavaScript,,,,,Bash / Shell,...,,,,,,,,,50–59,United States
3,4,"No, I don’t use Python for my current projects",,Java,JavaScript,,,,,Bash / Shell,...,,,,,,,,,30–39,United Kingdom
4,5,"No, I don’t use Python for my current projects",,Java,,,,,,,...,,,,,,,,,21–29,United States


In [120]:
print(PYTHON_OBSERVATIONS_DF.iloc[:, 1].value_counts())
PYTHON_OBSERVATIONS_DF.head()

Yes                                         15404
No, I use Python as a secondary language     3027
Name: Is Python the main language you use for your current projects?, dtype: int64


Unnamed: 0.1,Unnamed: 0,Is Python the main language you use for your current projects?,None:What other language(s) do you use?,Java:What other language(s) do you use?,JavaScript:What other language(s) do you use?,C/C++:What other language(s) do you use?,PHP:What other language(s) do you use?,C#:What other language(s) do you use?,Ruby:What other language(s) do you use?,Bash / Shell:What other language(s) do you use?,...,Technical support:Which of the following best describes your job role(s)?,Data analyst:Which of the following best describes your job role(s)?,Business analyst:Which of the following best describes your job role(s)?,Team lead:Which of the following best describes your job role(s)?,Product manager:Which of the following best describes your job role(s)?,CIO / CEO / CTO:Which of the following best describes your job role(s)?,Systems analyst:Which of the following best describes your job role(s)?,Other Write In::Which of the following best describes your job role(s)?,Could you tell us your age range?,What country do you live in?
1404,1405,Yes,,Java,JavaScript,,,,,,...,,,,,,,,,40–49,United States
1405,1406,Yes,,,JavaScript,,,,,,...,,,,,,,,,18–20,Nigeria
1406,1407,Yes,,,JavaScript,,,,,Bash / Shell,...,,,,,,,,,21–29,United States
1407,1408,Yes,,,,,,,,Bash / Shell,...,,,,,,CIO / CEO / CTO,,,50–59,United States
1408,1409,Yes,,,JavaScript,,,,,,...,,,,Team lead,,,,,40–49,United Kingdom


From the above we have successfully put all observations of python users into the `PYTHON_OBSERVATIONS_DF` constant and all non users into `NON_PYTHON_OBSERVATIONS_DF`


Now we are ready to start processing our questions


## Question 2

In this multi choice question, the users were asked:
> What other language(s) do you use?

The options of this question spans across mulitiple columns number 2 to 25. 


In [133]:
question_two = PYTHON_OBSERVATIONS_DF.iloc[:, 2:26]

In [134]:
question_two


Unnamed: 0,None:What other language(s) do you use?,Java:What other language(s) do you use?,JavaScript:What other language(s) do you use?,C/C++:What other language(s) do you use?,PHP:What other language(s) do you use?,C#:What other language(s) do you use?,Ruby:What other language(s) do you use?,Bash / Shell:What other language(s) do you use?,Objective-C:What other language(s) do you use?,Go:What other language(s) do you use?,...,R:What other language(s) do you use?,Swift:What other language(s) do you use?,Clojure:What other language(s) do you use?,Perl:What other language(s) do you use?,Rust:What other language(s) do you use?,Groovy:What other language(s) do you use?,TypeScript:What other language(s) do you use?,CoffeeScript:What other language(s) do you use?,HTML/CSS:What other language(s) do you use?,Other - Write In::What other language(s) do you use?
1404,,Java,JavaScript,,,,,,,,...,,,,,,,,,,
1405,,,JavaScript,,,,,,,,...,,,,,,,,,HTML/CSS,
1406,,,JavaScript,,,,,Bash / Shell,,,...,,,,,,,,,,
1407,,,,,,,,Bash / Shell,,,...,,,,,,,,,,Other - Write In:
1408,,,JavaScript,,,,,,,,...,,,Clojure,,Rust,,,,HTML/CSS,
1409,,,,,,,,Bash / Shell,,,...,R,,,,,,,,,
1410,,Java,JavaScript,C/C++,,,,Bash / Shell,,Go,...,R,,,,Rust,,,,,Other - Write In:
1411,,Java,,C/C++,,,,,,,...,,,,,,,,,,
1412,,,,C/C++,,,,,,,...,,,,,,,,,HTML/CSS,
1413,,,,,,,,Bash / Shell,,,...,,,,,,,,,,


In [123]:
melted_observations = pd.melt(question_two, var_name='option', value_name='reply')

# melted_observations = melted_observations[ melted_observations.reply.dropna()]
melted_observations  = melted_observations.dropna()

print(melted_observations.option.value_counts())
print('-----------')
print(melted_observations.reply.value_counts())



JavaScript:What other language(s) do you use?           9233
HTML/CSS:What other language(s) do you use?             8633
Bash / Shell:What other language(s) do you use?         8245
SQL:What other language(s) do you use?                  7451
C/C++:What other language(s) do you use?                5757
Java:What other language(s) do you use?                 4161
PHP:What other language(s) do you use?                  2385
C#:What other language(s) do you use?                   2077
Other - Write In::What other language(s) do you use?    1900
Go:What other language(s) do you use?                   1734
R:What other language(s) do you use?                    1589
TypeScript:What other language(s) do you use?           1385
None:What other language(s) do you use?                 1015
Ruby:What other language(s) do you use?                  800
Visual Basic:What other language(s) do you use?          757
Rust:What other language(s) do you use?                  741
Perl:What other language

In [124]:
lamda_func = lambda row: 'What other language(s) do you use?'

melted_observations.option = melted_observations.apply(lamda_func,axis=1)

melted_observations.reply = melted_observations.reply.astype('category')
melted_observations.option = melted_observations.option.astype('category')
melted_observations.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60739 entries, 26 to 442332
Data columns (total 2 columns):
option    60739 non-null category
reply     60739 non-null category
dtypes: category(2)
memory usage: 594.1 KB


In [125]:
# converting the option to the question title

melted_observations.head()

Unnamed: 0,option,reply
26,What other language(s) do you use?,
64,What other language(s) do you use?,
159,What other language(s) do you use?,
206,What other language(s) do you use?,
209,What other language(s) do you use?,


In [126]:
sheet_ready_question_two = melted_observations.pivot(columns='option' , values='reply')

In [127]:
sheet_ready_question_two.describe()
sheet_ready_question_two.iloc[:, 0] .value_counts()


JavaScript           9233
HTML/CSS             8633
Bash / Shell         8245
SQL                  7451
C/C++                5757
Java                 4161
PHP                  2385
C#                   2077
Other - Write In:    1900
Go                   1734
R                    1589
TypeScript           1385
None                 1015
Ruby                  800
Visual Basic          757
Rust                  741
Perl                  545
Kotlin                517
Scala                 509
Swift                 448
Groovy                307
Objective-C           254
CoffeeScript          151
Clojure               145
Name: What other language(s) do you use?, dtype: int64

We have successfully prepared the question two data for analysis and stored it in the `sheet_ready_question_two` variable. Next is to process question three


Next let us process the next question 

## Question 3 

> For what purposes do you mainly use Python?

The above is single choice question with the options
- For work
- For personal education
- Both for work and personal


In [140]:
question_three = PYTHON_OBSERVATIONS_DF.iloc[:, [26]]
question_three.head()


Unnamed: 0,For what purposes do you mainly use Python?
1404,Both for work and personal
1405,Both for work and personal
1406,For work
1407,Both for work and personal
1408,Both for work and personal


In [144]:
question_three.iloc[:, 0].value_counts()

Both for work and personal                     10987
For personal, educational  or side projects     3852
For work                                        3592
Name: For what purposes do you mainly use Python?, dtype: int64

The data is in the exact shape and structure as we want it. Thus we can say that this data is ready to be exported to the excel sheets


In [145]:
sheet_ready_question_three = question_three

### Question 4

This is a multi-choice question that has the same structure like question one. Each option is stored as a column. We would like to do the exact same cleaning as before. 

Looking at this right now, we see that there is a need to abstract the steps of cleanning the data into functions. Thus, making it easy to re-shape similar data

What we are going to do here, is to create functions that handle the extraction of multi-choice question



In [154]:
# Let us first get the dataframe for the questions

question_four_df = PYTHON_OBSERVATIONS_DF.iloc[:, 27:43]

question_four_df.head()


Unnamed: 0,Educational purposes:What do you use Python for?,Data analysis:What do you use Python for?,DevOps / System administration / Writing automation scripts:What do you use Python for?,Software testing / Writing automated tests:What do you use Python for?,Software prototyping:What do you use Python for?,Web development:What do you use Python for?,Machine learning:What do you use Python for?,Mobile development:What do you use Python for?,Desktop development:What do you use Python for?,Computer graphics:What do you use Python for?,Network programming:What do you use Python for?,Game development:What do you use Python for?,Multimedia applications development:What do you use Python for?,Embedded development:What do you use Python for?,Programming of web parsers / scrapers / crawlers:What do you use Python for?,Other - Write In::What do you use Python for?
1404,,Data analysis,DevOps / System administration / Writing autom...,Software testing / Writing automated tests,,,Machine learning,,,,,,,,,
1405,,Data analysis,,,,Web development,Machine learning,,,,,,,,,
1406,Educational purposes,,DevOps / System administration / Writing autom...,,,,,,,,Network programming,,,,,
1407,,Data analysis,DevOps / System administration / Writing autom...,Software testing / Writing automated tests,Software prototyping,Web development,,,Desktop development,,Network programming,,,,Programming of web parsers / scrapers / crawlers,
1408,,Data analysis,DevOps / System administration / Writing autom...,Software testing / Writing automated tests,,Web development,Machine learning,,,,Network programming,,,,Programming of web parsers / scrapers / crawlers,


In [313]:
def reshape_data(dataframe, question_asked, var_name='question', value_name='answer_chosen', fillna_with_str='', dropna=True ):
    melted_observations = pd.melt(dataframe, var_name=var_name, value_name=value_name)

    # melted_observations = melted_observations[ melted_observations.reply.dropna()]
    if dropna:
        melted_observations  = melted_observations.dropna()
    else:
        melted_observations  = melted_observations.fillna(fillna_with_str)

    lamda_func = lambda row: question_asked

    melted_observations[var_name] = melted_observations.apply(lamda_func,axis=1)
    
    pivoted_data = melted_observations.pivot(columns=var_name , values=value_name)
    return pivoted_data
    
    

In [314]:
sheet_ready_question_four = reshape_data(question_four_df, question_asked='What do you use Python for?')
sheet_ready_question_four.head()


question,What do you use Python for?
2,Educational purposes
5,Educational purposes
6,Educational purposes
7,Educational purposes
8,Educational purposes


In [185]:
sheet_ready_question_four.iloc[:, 0].value_counts(dropna=False)

Data analysis                                                  10615
Web development                                                 9656
DevOps / System administration / Writing automation scripts     7853
Machine learning                                                7074
Programming of web parsers / scrapers / crawlers                6799
Software testing / Writing automated tests                      5859
Educational purposes                                            5108
Software prototyping                                            4955
Network programming                                             3755
Desktop development                                             3442
Computer graphics                                               1694
Embedded development                                            1482
Other - Write In:                                               1312
Game development                                                1186
Mobile development                

In [164]:
sheet_ready_question_four.iloc[:, 0].describe()

count             72265
unique               16
top       Data analysis
freq              10615
Name: What do you use Python for?, dtype: object

In [167]:
sheet_ready_question_four.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 72265 entries, 2 to 294882
Data columns (total 1 columns):
What do you use Python for?    72265 non-null object
dtypes: object(1)
memory usage: 1.1+ MB




#### Pipped question from question 4


The following questions(**question 5 and 6**) were piped for question 4: 

**Question 5**
> To what extent are you involved in the following activities?

The answer to this question has the following properties:
- all the option from `question 4`*(what do you use python for?)* are repeated as columns and the values are either primary activity, secondary activity or hobby
- the options the user answered from `question 4` are all contained in this question

**Question 6**
> What do you use Python for most?

This value of this question is `Nan` if the user selected only one answer from `question 4`. Thus, we can assume that the answer to this question when the value is `Nan` is the one answer the user selected

**Question 7**
> Do you consider yourself as a Data-Scientist?

This question is displayed if the user selects "Data analysis" or "Machine learning" as the answer to question_four. Thus, we can replace the `Nan` with `I do neither data science nor machine learning`



In [244]:
question_five = PYTHON_OBSERVATIONS_DF.iloc[:, 43:59]
question_five.head()

Unnamed: 0,Educational purposes:To what extent are you involvedin the following activities?,Data analysis:To what extent are you involvedin the following activities?,DevOps / System administration / Writing automation scripts:To what extent are you involvedin the following activities?,Software testing / Writing automated tests:To what extent are you involvedin the following activities?,Software prototyping:To what extent are you involvedin the following activities?,Web development:To what extent are you involvedin the following activities?,Machine learning:To what extent are you involvedin the following activities?,Mobile development:To what extent are you involvedin the following activities?,Desktop development:To what extent are you involvedin the following activities?,Computer graphics:To what extent are you involvedin the following activities?,Network programming:To what extent are you involvedin the following activities?,Game development:To what extent are you involvedin the following activities?,Multimedia applications development:To what extent are you involvedin the following activities?,Embedded development:To what extent are you involvedin the following activities?,Programming of web parsers / scrapers / crawlers:To what extent are you involvedin the following activities?,Other - Write In::To what extent are you involvedin the following activities?
1404,,primary activity,secondary activity,secondary activity,,,hobby,,,,,,,,,
1405,,primary activity,,,,primary activity,primary activity,,,,,,,,,
1406,hobby,,primary activity,,,,,,,,primary activity,,,,,
1407,,secondary activity,primary activity,secondary activity,hobby,secondary activity,,,secondary activity,,primary activity,,,,secondary activity,
1408,,primary activity,secondary activity,secondary activity,,primary activity,secondary activity,,,,secondary activity,,,,secondary activity,


In [175]:
sheet_ready_question_five = reshape_data(question_five, question_asked='To what extent are you involvedin the following activities?')
sheet_ready_question_five.head()

question,To what extent are you involvedin the following activities?
2,hobby
5,primary activity
6,primary activity
7,primary activity
8,secondary activity


In [176]:
sheet_ready_question_four.shape

(72265, 1)

In [177]:
sheet_ready_question_five.shape

(72265, 1)

In [258]:
sheet_ready_question_five = pd.merge(sheet_ready_question_four, sheet_ready_question_five,left_index=True, right_index=True)

In [260]:
print(sheet_ready_question_five.shape)
sheet_ready_question_five.head()

(72265, 2)


question,What do you use Python for?,To what extent are you involvedin the following activities?
2,Educational purposes,hobby
5,Educational purposes,primary activity
6,Educational purposes,primary activity
7,Educational purposes,primary activity
8,Educational purposes,secondary activity


In [233]:
question_six = PYTHON_OBSERVATIONS_DF.iloc[:, [59]]
question_six.head()
print(question_six.shape)
question_six = pd.merge(question_four_df, question_six, left_index=True, right_index=True)

(18431, 1)


In [234]:
print(question_six.iloc[:, -1].value_counts(dropna=False))
print(question_six.shape)
question_six.iloc[:, [-1]].head()

Web development                                                4364
Data analysis                                                  2800
NaN                                                            2184
Machine learning                                               1785
DevOps / System administration / Writing automation scripts    1738
Educational purposes                                           1191
Other - Write In:                                               872
Desktop development                                             685
Software prototyping                                            630
Programming of web parsers / scrapers / crawlers                595
Software testing / Writing automated tests                      560
Network programming                                             457
Embedded development                                            198
Game development                                                158
Computer graphics                               

Unnamed: 0,What do you use Python for the most?
1404,Data analysis
1405,Web development
1406,DevOps / System administration / Writing autom...
1407,Software prototyping
1408,Web development


In [241]:
def change_all_nans_to_main_use(row):
    """Replaces the nan values
    
    Replaces the nan values in the column 'What do you use Python for the most?'
    with the answer to the question 'What do you use Python for?'
    
    """
    users_most_python_use_case = row[-1]
    if isinstance(users_most_python_use_case, float) and  math.isnan(users_most_python_use_case) :

        primary_use_case = next(item for item in row.values if isinstance(item, str))
        row[-1]= primary_use_case
        
    return row

question_six_with_no_nans  = question_six.apply(change_all_nans_to_main_use,axis=1)
question_six_with_no_nans.iloc[:, -1].value_counts(dropna=False)

question_six_with_no_nans = question_six_with_no_nans.iloc[:,[ -1]]
question_six_with_no_nans.head()

Unnamed: 0,What do you use Python for the most?
1404,Data analysis
1405,Web development
1406,DevOps / System administration / Writing autom...
1407,Software prototyping
1408,Web development


In [242]:

question_six_with_no_nans.iloc[:,0].value_counts(dropna=False)

Web development                                                5072
Data analysis                                                  3136
DevOps / System administration / Writing automation scripts    1950
Machine learning                                               1923
Educational purposes                                           1496
Other - Write In:                                               876
Desktop development                                             780
Software prototyping                                            650
Software testing / Writing automated tests                      650
Programming of web parsers / scrapers / crawlers                637
Network programming                                             501
Embedded development                                            233
Game development                                                211
Computer graphics                                               169
Multimedia applications development             

In [None]:
sheet_ready_question_six = question_six_with_no_nans


In [254]:
question_seven = PYTHON_OBSERVATIONS_DF.iloc[:, [60]]
question_seven.head()

Unnamed: 0,Do you consider yourself as a Data-Scientist?
1404,No
1405,No
1406,
1407,No
1408,No


In [255]:
question_seven.iloc[:, 0].value_counts(dropna=False)

No                   7509
NaN                  6582
Yes                  3606
Other – Write In:     734
Name: Do you consider yourself as a Data-Scientist?, dtype: int64

In [257]:
sheet_ready_question_seven = \
    question_seven.fillna('I do neither data science nor machine learning')
sheet_ready_question_seven.iloc[:, 0].value_counts(dropna=False)

No                                                7509
I do neither data science nor machine learning    6582
Yes                                               3606
Other – Write In:                                  734
Name: Do you consider yourself as a Data-Scientist?, dtype: int64

#### Question 8

> Which version of Python do you use the most?*

There are only two possible answers to this question namely:
- Python 2
- Python 3



In [262]:
# Extracting the answers to question 8 and storing it in the variable question_eight
question_eight = PYTHON_OBSERVATIONS_DF.iloc[:, [61]]
print(question_eight.iloc[:, 0].value_counts(dropna=False))
question_eight.head()

Python 3    15446
Python 2     2985
Name: Which version of Python do you use the most?, dtype: int64


Unnamed: 0,Which version of Python do you use the most?
1404,Python 3
1405,Python 3
1406,Python 3
1407,Python 3
1408,Python 3


The data above is as clean as can be as there are no Nan values

In [264]:
# Since the data is sheet ready we store it in sheet_ready_question_eight variable
sheet_ready_question_eight = question_eight
sheet_ready_question_eight.shape

(18431, 1)

#### Pipped question from question 8

The next two questions are dependent on the answer of question 8

**Quesiton 9**
>Which version of Python 2 do you use the most?*

The above question is displayed when the user selects `python 2` in  question 8

**Quesiton 10**
>Which version of Python 3 do you use the most?*

The above question is displayed when the user selects `python 3` in  question 8


Nan logic
---
 
When the question is not displayed to the user,  then `Nan` is put as the value. I think we can just dropna for both questions


In [265]:
# Extracting the answers to question 9 and storing it in the variable question_nine
question_nine = PYTHON_OBSERVATIONS_DF.iloc[:, [62]]
print(question_nine.iloc[:, 0].value_counts(dropna=False))
question_nine.head()

NaN           15605
Python 2.7     2630
Python 2.6       65
Python 2.0       46
Python 2.4       20
Python 2.1       18
Python 2.5       17
Python 2.3       16
Python 2.2       14
Name: Which version of Python 2 do you use the most?, dtype: int64


Unnamed: 0,Which version of Python 2 do you use the most?
1404,
1405,
1406,
1407,
1408,


In [267]:
# Extracting the answers to question 10 and storing it in the variable question_ten
question_ten = PYTHON_OBSERVATIONS_DF.iloc[:, [63]]
print(question_ten.iloc[:, 0].value_counts(dropna=False))
question_ten.head()

Python 3.6    7812
Python 3.7    4277
NaN           4089
Python 3.5    1550
Python 3.4     394
Python 3.0      94
Python 3.2      75
Python 3.1      72
Python 3.3      68
Name: Which version of Python 3 do you use the most?, dtype: int64


Unnamed: 0,Which version of Python 3 do you use the most?
1404,Python 3.6
1405,Python 3.7
1406,Python 3.6
1407,Python 3.6
1408,Python 3.5


In [271]:
sheet_ready_question_nine = question_nine.dropna()
sheet_ready_question_ten = question_ten.dropna()

print(sheet_ready_question_nine.iloc[:, 0].value_counts())
print('-----')
print(sheet_ready_question_ten.iloc[:, 0].value_counts())

Python 2.7    2630
Python 2.6      65
Python 2.0      46
Python 2.4      20
Python 2.1      18
Python 2.5      17
Python 2.3      16
Python 2.2      14
Name: Which version of Python 2 do you use the most?, dtype: int64
-----
Python 3.6    7812
Python 3.7    4277
Python 3.5    1550
Python 3.4     394
Python 3.0      94
Python 3.2      75
Python 3.1      72
Python 3.3      68
Name: Which version of Python 3 do you use the most?, dtype: int64


#### Question 11

> What do you typically use to upgrade your Python version?

The above is a multi-choice question whose options are stored in column index 64 to 77. We would process this using the `reshape_data` function that reshapes multi-choice questions



In [277]:
question_eleven = PYTHON_OBSERVATIONS_DF.iloc[:, 64:78]
print(question_eleven.shape)
question_eleven.head()

(18431, 14)


Unnamed: 0,I dont update:What do you typically use to upgrade your Python version?,Somebody else manages Python updates for me:What do you typically use to upgrade your Python version?,Python.org:What do you typically use to upgrade your Python version?,Build from source:What do you typically use to upgrade your Python version?,Automatic upgrade via cloud provider:What do you typically use to upgrade your Python version?,Enthought:What do you typically use to upgrade your Python version?,Anaconda:What do you typically use to upgrade your Python version?,ActivePython:What do you typically use to upgrade your Python version?,Intel Distribution for Python:What do you typically use to upgrade your Python version?,"OS-provided Python (via apt-get, yum, homebrew, etc.):What do you typically use to upgrade your Python version?",pyenv:What do you typically use to upgrade your Python version?,pythonz:What do you typically use to upgrade your Python version?,I use Docker containers:What do you typically use to upgrade your Python version?,Other Write In::What do you typically use to upgrade your Python version?
1404,,,,,,,Anaconda,,,,,,,
1405,,,,,,,Anaconda,,Intel Distribution for Python,,,,,
1406,,,Python.org,,,,,,,,,,,
1407,,,Python.org,,,,,,,"OS-provided Python (via apt-get, yum, homebrew...",,,,
1408,,,,,,,,,,"OS-provided Python (via apt-get, yum, homebrew...",,,I use Docker containers,


In [280]:

sheet_ready_question_eleven = \
reshape_data(question_eleven, question_asked='What do you typically use to upgrade your Python version?')

print(sheet_ready_question_eleven.shape)
sheet_ready_question_eleven.iloc[:, 0].value_counts()


(27697, 1)


OS-provided Python (via apt-get, yum, homebrew, etc.)    6912
Python.org                                               6128
Anaconda                                                 4054
I use Docker containers                                  2788
pyenv                                                    2728
I don’t update                                           1539
Build from source                                        1297
Somebody else manages Python updates for me               854
Other – Write In:                                         606
Automatic upgrade via cloud provider                      348
Intel Distribution for Python                             171
ActivePython                                              151
pythonz                                                    69
Enthought                                                  52
Name: What do you typically use to upgrade your Python version?, dtype: int64

#### Question 12

> Do you use any of the following tools to isolate Python environments, if any?

The above is a multi-choice question whose options are stored in column index 78 to 83. We do exactly what we did for question 11




In [367]:
question_twelve = PYTHON_OBSERVATIONS_DF.iloc[:, 78:84]
print(question_twelve.shape)
question_twelve.head()

(18431, 6)


Unnamed: 0,"None:Do you use any of the following tools to isolate Python environments, if any?","Virtualenv / pipenv:Do you use any of the following tools to isolate Python environments, if any?","Conda:Do you use any of the following tools to isolate Python environments, if any?","Docker:Do you use any of the following tools to isolate Python environments, if any?","Vagrant / virtual machines:Do you use any of the following tools to isolate Python environments, if any?","Other Write In::Do you use any of the following tools to isolate Python environments, if any?"
1404,,,,,,
1405,,,Conda,,,
1406,,,,,,
1407,,Virtualenv / pipenv,,,,
1408,,Virtualenv / pipenv,,Docker,,


In [368]:
question_12_str = 'Do you use any of the following tools to isolate Python environments, if any?'
sheet_ready_question_twelve = \
reshape_data(question_twelve, question_asked=question_12_str)

print(sheet_ready_question_twelve.shape)
sheet_ready_question_twelve.iloc[:, 0].value_counts()

(27636, 1)


Virtualenv / pipenv            11729
Docker                          5787
Conda                           4084
None                            3800
Vagrant  / virtual machines     1676
Other – Write In:                560
Name: Do you use any of the following tools to isolate Python environments, if any?, dtype: int64

#### Question 13

> What web frameworks / libraries do you use in addition to Python?

Multi question with columns spanning index 83 to 95. Similar processing 




In [369]:
# Extracting the data
question_13_str = 'What web frameworks / libraries do you use in addition to Python?'

question_thirteen = PYTHON_OBSERVATIONS_DF.iloc[:, 84:96]
print(question_thirteen.shape)
question_thirteen.head()


(18431, 12)


Unnamed: 0,None:What web frameworks / libraries do you use in addition to Python?,Django:What web frameworks / libraries do you use in addition to Python?,TurboGears:What web frameworks / libraries do you use in addition to Python?,web2py:What web frameworks / libraries do you use in addition to Python?,Bottle:What web frameworks / libraries do you use in addition to Python?,CherryPy:What web frameworks / libraries do you use in addition to Python?,Flask:What web frameworks / libraries do you use in addition to Python?,Hug:What web frameworks / libraries do you use in addition to Python?,Pyramid:What web frameworks / libraries do you use in addition to Python?,Tornado:What web frameworks / libraries do you use in addition to Python?,Falcon:What web frameworks / libraries do you use in addition to Python?,Other Write In::What web frameworks / libraries do you use in addition to Python?
1404,,,,,,,Flask,,,,,
1405,,Django,,,,,,,,,,
1406,,,,,,,,,,,,
1407,,Django,,,,,Flask,,,,,
1408,,Django,,,,,Flask,,,Tornado,,


In [296]:
sheet_ready_question_thirteen = \
reshape_data(question_thirteen, question_asked=question_13_str)

print(sheet_ready_question_thirteen.shape)
sheet_ready_question_thirteen.iloc[:, 0].value_counts(dropna=False)


(54890, 1)


Virtualenv / pipenv            11729
Flask                           8615
None                            8473
Django                          8354
Docker                          5787
Conda                           4084
Other – Write In:               1732
Vagrant  / virtual machines     1676
Tornado                         1080
Pyramid                          720
web2py                           693
Bottle                           684
CherryPy                         543
Falcon                           456
Hug                              159
TurboGears                       105
Name: What web frameworks / libraries do you use in addition to Python?, dtype: int64

#### Question 14

> What data science framework(s) do you use in addition to Python?

Multi question with columns spanning index 96 to 108. Similar processing 


In [299]:
question_14_str = 'What data science framework(s) do you use in addition to Python?'
question_fourteen = PYTHON_OBSERVATIONS_DF.iloc[:, 96:109]
print(question_fourteen.shape)
question_fourteen.head()

(18431, 13)


Unnamed: 0,None:What data science framework(s) do you use in addition to Python?,NumPy:What data science framework(s) do you use in addition to Python?,SciPy:What data science framework(s) do you use in addition to Python?,Pandas:What data science framework(s) do you use in addition to Python?,Matplotlib:What data science framework(s) do you use in addition to Python?,Seaborn:What data science framework(s) do you use in addition to Python?,SciKit-Learn:What data science framework(s) do you use in addition to Python?,Keras:What data science framework(s) do you use in addition to Python?,TensorFlow:What data science framework(s) do you use in addition to Python?,Theano:What data science framework(s) do you use in addition to Python?,NLTK:What data science framework(s) do you use in addition to Python?,Gensim:What data science framework(s) do you use in addition to Python?,Other - Write In::What data science framework(s) do you use in addition to Python?
1404,,NumPy,,Pandas,Matplotlib,,,,,,,,
1405,,,,Pandas,Matplotlib,,,Keras,TensorFlow,,,,
1406,,,,,,,,,,,,,
1407,,NumPy,SciPy,,,,,,,,,,
1408,,NumPy,,Pandas,,,SciKit-Learn,,,,,,


In [302]:
sheet_ready_question_fourteen = \
reshape_data(question_fourteen, question_asked=question_14_str)

print(sheet_ready_question_fourteen.shape)
sheet_ready_question_fourteen.iloc[:, 0].value_counts(dropna=False)


(61707, 1)


NumPy                11384
Pandas                9314
Matplotlib            8434
SciPy                 6989
SciKit-Learn          5787
None                  5018
TensorFlow            4653
Keras                 2776
Seaborn               2713
NLTK                  2413
Other - Write In:      909
Gensim                 760
Theano                 557
Name: What data science framework(s) do you use in addition to Python?, dtype: int64

The last few processing seemed similar. Although most of the work was done in the function `reshape_data`, it turns out that there are some repetitions.

I would like to spend some time creating functions that automate these repetitions


In [307]:

def reshape_and_show_info(question_asked, start_index, end_index):
    untidy_question_df = PYTHON_OBSERVATIONS_DF.iloc[:, start_index:end_index]
    print('---Initial Shape---')
    print(question_fourteen.shape)
    reshaped_df = \
    reshape_data(untidy_question_df, question_asked=question_asked)
    
    print('---Shape after reshaping---')
    print(reshaped_df.shape)
    print(reshaped_df.iloc[:, 0].value_counts(dropna=False))
    return reshaped_df



Subsequently, I would simply cann the function `reshape_and_show_info` for multi-choice questions

#### Question 15

> Which of the following frameworks / libraries do you use in addition to Python?

Multi question with columns spanning index 109 to 123. Would be processed using the `reshape_and_show_info `



In [310]:
question_15_str = 'Which of the following frameworks / libraries do you use in addition to Python?'

sheet_ready_question_fifteen = reshape_and_show_info( question_15_str, 109, 124)
sheet_ready_question_fifteen

---Initial Shape---
(18431, 13)
---Shape after reshaping---
(43581, 1)
Requests             9834
Pillow               5430
Scrapy               3462
Asyncio              3311
Tkinter              3303
None                 3285
PyQT                 2985
Six                  2620
aiohttp              1948
Pygame               1943
Other – Write In:    1364
wxPython             1131
Twisted              1125
Kivy                 1043
PyGTK                 797
Name: Which of the following frameworks / libraries do you use in addition to Python?, dtype: int64


question,Which of the following frameworks / libraries do you use in addition to Python?
0,
12,
20,
25,
29,
44,
45,
50,
67,
93,


#### Question 16

> Which of the following cloud platforms do you use?

Multi question with columns spanning index 124 to 135. Similar processing 




In [330]:
question_str ='Which of the following cloud platforms do you use?'
sheet_ready_question_sixteen = reshape_and_show_info( question_str, 124, 136)
sheet_ready_question_sixteen

---Initial Shape---
(18431, 13)
---Shape after reshaping---
(28938, 1)
AWS                      6679
None                     6354
Google Cloud Platform    3623
DigitalOcean             3133
Heroku                   2603
Microsoft Azure          1845
PythonAnywhere           1561
Other – Write In:        1008
OpenStack                 812
Linode                    723
OpenShift                 396
Rackspace                 201
Name: Which of the following cloud platforms do you use?, dtype: int64


question,Which of the following cloud platforms do you use?
2,
4,
7,
8,
12,
15,
23,
25,
26,
27,


#### Dependent questions

The next two questions are dependent on the answer of question 16

**Quesiton 17**
>How do you run code in the cloud (in the production environment)?

Spans colums 137 to 142

**Quesiton 18**
>How do you develop for the cloud?

Spans colums 143 to 151

They are shown only when the user did not choose `None` in question 16. Thus, Nan indicates that:
- the user does not use any cloud platform



Nan logic
---
 
When the question is not displayed to the user,  then `Nan` is put as the value, we replace it with `I dont use cloud computing`


Let us spend some time automating the processing for both questions in a function



In [336]:
def reshape_by_replacing_na_and_show_info(question_asked, start_index, end_index, na_str, ):
    untidy_question_df = PYTHON_OBSERVATIONS_DF.iloc[:, start_index:end_index]
    print('---Initial Shape---')
    print(untidy_question_df.shape)

    reshaped_df = \
        reshape_data(untidy_question_df, question_asked=question_asked,fillna_with_str=na_str,dropna=False)
    
    print('---Shape after reshaping---')
    print(reshaped_df.shape)
    print(reshaped_df.iloc[:, 0].value_counts(dropna=False))
    return reshaped_df



In [342]:
# processing question 17
question_str = 'How do you run code in the cloud (in the production environment)?'
na_str= 'I dont use cloud computing'
sheet_ready_question_seventeen =  \
reshape_by_replacing_na_and_show_info(question_str, 136, 142, na_str)
sheet_ready_question_seventeen.head(10)

---Initial Shape---
(18431, 6)
---Shape after reshaping---
(110586, 1)
I dont use cloud computing                                          93649
In virtual machines                                                  5308
Within containers                                                    4461
On a Platform-as-a-Service (such as Heroku or Google App Engine)     3084
Serverless (such as AWS Lambda or Cloud Functions)                   2353
None of the following                                                1451
Other – Write In:                                                     280
Name: How do you run code in the cloud (in the production environment)?, dtype: int64


question,How do you run code in the cloud (in the production environment)?
0,I dont use cloud computing
1,I dont use cloud computing
2,I dont use cloud computing
3,I dont use cloud computing
4,I dont use cloud computing
5,I dont use cloud computing
6,I dont use cloud computing
7,I dont use cloud computing
8,I dont use cloud computing
9,I dont use cloud computing


In [340]:
# processing question 18
question_str = 'How do you develop for the cloud?'
na_str= 'I dont use cloud computing'
sheet_ready_question_eighteen =  \
reshape_by_replacing_na_and_show_info(question_str, 142, 150, na_str)
sheet_ready_question_eighteen.head(10)

---Initial Shape---
(18431, 8)
---Shape after reshaping---
(147448, 1)
I dont use cloud computing                128525
Locally with virtualenv (or similar)        6317
In Docker containers                        3940
In virtual machines                         2728
With local system interpreter               1959
In remote development environments          1855
None of the following                       1042
Directly in the production environment       968
Other – Write In:                            114
Name: How do you develop for the cloud?, dtype: int64


question,How do you develop for the cloud?
0,I dont use cloud computing
1,I dont use cloud computing
2,I dont use cloud computing
3,I dont use cloud computing
4,I dont use cloud computing
5,I dont use cloud computing
6,I dont use cloud computing
7,I dont use cloud computing
8,I dont use cloud computing
9,I dont use cloud computing


#### Question 19

> What operating system(s) are your development environment?

Multi question with columns spanning index 151 to 155. Similar processing




In [343]:
question_nineteen = 'What operating system(s) are your development environment?'

sheet_ready_question_nineteen = reshape_and_show_info( question_15_str, 150, 155)
sheet_ready_question_nineteen.head(10)

---Initial Shape---
(18431, 13)
---Shape after reshaping---
(27753, 1)
Linux                12780
Windows               8703
macOS                 5806
BSD                    264
Other – Write In:      200
Name: Which of the following frameworks / libraries do you use in addition to Python?, dtype: int64


question,Which of the following frameworks / libraries do you use in addition to Python?
3,Windows
5,Windows
11,Windows
14,Windows
19,Windows
23,Windows
26,Windows
28,Windows
32,Windows
36,Windows


#### Question 20

> Which Python unit-testing framework(s) do you use, if any?

The patterns are now looking the same. From now onwards I would just stop putting the questions in markdowns an would just go straight to extracting the questions



In [345]:
# Processing Question 20
question_str ='Which Python unit-testing framework(s) do you use, if any?'

sheet_ready_question_twenty = reshape_and_show_info( question_15_str, 155, 164)
sheet_ready_question_twenty.head(10)

---Initial Shape---
(18431, 13)
---Shape after reshaping---
(28787, 1)
pytest               8448
None                 6526
unittest             5873
mock                 2848
Tox                  1590
nose                 1576
doctest               992
Hypothesis            659
Other – Write In:     275
Name: Which of the following frameworks / libraries do you use in addition to Python?, dtype: int64


question,Which of the following frameworks / libraries do you use in addition to Python?
7,
20,
21,
24,
28,
29,
38,
40,
44,
48,


In [346]:
# Processing muti-chioce Question 21
question_str ='What ORM(s) do you use together with Python, if any?'

sheet_ready_question_twenty_one = reshape_and_show_info( question_15_str,  164, 173)
sheet_ready_question_twenty_one.head(10)

---Initial Shape---
(18431, 13)
---Shape after reshaping---
(22526, 1)
None                 7941
SQLAlchemy           6268
Django ORM           6010
SQLObject             771
Peewee                663
Other – Write In:     391
PonyORM               212
Tortoise ORM          148
Dejavu                122
Name: Which of the following frameworks / libraries do you use in addition to Python?, dtype: int64


question,Which of the following frameworks / libraries do you use in addition to Python?
0,
2,
7,
8,
12,
13,
19,
25,
26,
27,


In [347]:
# Processing muti-chioce Question 22
question_str ='Which database(s) do you regularly use, if any?'

sheet_ready_question_twenty_two = reshape_and_show_info( question_15_str,  173, 189)
sheet_ready_question_twenty_two.head(10)

---Initial Shape---
(18431, 13)
---Shape after reshaping---
(39970, 1)
PostgreSQL           8133
MySQL                7562
SQLite               7530
MongoDB              3774
Redis                3488
None                 2779
MS SQL Server        1992
Oracle Database      1396
Other – Write In:    1144
Amazon Redshift       555
Cassandra             480
Neo4j                 388
DB2                   249
HBase                 216
Couchbase             183
h2                    101
Name: Which of the following frameworks / libraries do you use in addition to Python?, dtype: int64


question,Which of the following frameworks / libraries do you use in addition to Python?
7,
8,
25,
26,
27,
48,
49,
54,
55,
67,


In [348]:
# Processing muti-chioce Question 23
question_str ='Which of the following Big Data tool(s) do you use, if any?'

sheet_ready_question_twenty_three = reshape_and_show_info( question_15_str,  189, 201)
sheet_ready_question_twenty_three.head(10)

---Initial Shape---
(18431, 13)
---Shape after reshaping---
(21722, 1)
None                       13954
Apache Spark                2263
Apache Hadoop/MapReduce     1452
Apache Kafka                1391
Apache Hive                  827
Dask                         760
Other – Write In:            262
Apache Beam                  198
ClickHouse                   195
Apache Flink                 160
Apache Tez                   158
Apache Samza                 102
Name: Which of the following frameworks / libraries do you use in addition to Python?, dtype: int64


question,Which of the following frameworks / libraries do you use in addition to Python?
1,
2,
3,
7,
8,
10,
11,
12,
13,
14,


In [357]:
# Processing muti-chioce Question 24
question_str = 'Which Continuous Integration (CI) system(s) do you regularly use?'


sheet_ready_question_twenty_four = reshape_and_show_info( question_str, 201, 211)
sheet_ready_question_twenty_four.head(10)

---Initial Shape---
(18431, 13)
---Shape after reshaping---
(23617, 1)
None                 8415
Jenkins / Hudson     4644
Gitlab CI            3407
Travis CI            3325
CircleCI             1319
Other – Write In:     868
AppVeyor              591
TeamCity              479
Bamboo                468
CruiseControl         101
Name: Which Continuous Integration (CI) system(s) do you regularly use?, dtype: int64


question,Which Continuous Integration (CI) system(s) do you regularly use?
1,
7,
8,
11,
17,
20,
21,
24,
28,
29,


In [358]:
# Processing muti-chioce Question 25
question_str = 'Which configuration management tools do you use, if any?'


sheet_ready_question_twenty_five = reshape_and_show_info( question_str, 211, 218)
sheet_ready_question_twenty_five.head(10)


---Initial Shape---
(18431, 13)
---Shape after reshaping---
(20217, 1)
None                 11921
Ansible               3621
Custom solution       1627
Puppet                1033
Salt                   825
Chef                   669
Other – Write In:      521
Name: Which configuration management tools do you use, if any?, dtype: int64


question,"Which configuration management tools do you use, if any?"
1,
2,
5,
7,
8,
9,
12,
14,
16,
18,


In [359]:
# Processing muti-chioce Question 26
question_str = 'What is the main editor you use for your current Python development?'
# 

sheet_ready_question_twenty_six = reshape_and_show_info( question_str, 218, 219)
sheet_ready_question_twenty_six.head(10)


---Initial Shape---
(18431, 13)
---Shape after reshaping---
(18431, 1)
PyCharm Professional Edition             3581
VS Code                                  2952
PyCharm Community Edition                2641
Vim                                      1831
Sublime Text                             1545
Jupyter Notebook                          976
Atom                                      974
IDLE                                      664
Emacs                                     564
Spyder                                    463
Other – Write In:                         450
IntelliJ IDEA                             373
NotePad++                                 343
Eclipse + Pydev                           312
None                                      309
Python Tools for Visual Studio (PTVS)     128
Wing IDE                                   82
Gedit                                      80
NetBeans                                   45
Komodo Editor                              28
TextMate 

question,What is the main editor you use for your current Python development?
0,NotePad++
1,VS Code
2,PyCharm Community Edition
3,Emacs
4,Wing IDE
5,Atom
6,PyCharm Professional Edition
7,PyCharm Professional Edition
8,Other – Write In:
9,PyCharm Professional Edition


In [360]:
# Processing muti-chioce Question 27
question_str = \
'What editors/IDEs do you use for Python development in addition to your main ide/editor?'


sheet_ready_question_twenty_seven = reshape_and_show_info( question_str, 219, 245)
sheet_ready_question_twenty_seven.head(10)



---Initial Shape---
(18431, 13)
---Shape after reshaping---
(34821, 1)
Vim                                      4581
Jupyter Notebook                         3881
Sublime Text                             3460
VS Code                                  3341
PyCharm Community Edition                2864
None                                     2784
NotePad++                                2745
Atom                                     2188
IDLE                                     1676
PyCharm Professional Edition             1345
Spyder                                   1044
Other – Write In:                         876
Gedit                                     789
IntelliJ IDEA                             680
Python Tools for Visual Studio (PTVS)     643
Emacs                                     564
Eclipse + Pydev                           536
NetBeans                                  262
TextMate                                  148
Wing IDE                                   91
Ninja-IDE

question,What editors/IDEs do you use for Python development in addition to your main ide/editor?
18471,PyCharm Professional Edition
18474,PyCharm Professional Edition
18518,PyCharm Professional Edition
18550,PyCharm Professional Edition
18554,PyCharm Professional Edition
18574,PyCharm Professional Edition
18588,PyCharm Professional Edition
18599,PyCharm Professional Edition
18611,PyCharm Professional Edition
18616,PyCharm Professional Edition


In [361]:
# Processing muti-chioce Question 28
question_str = \
'Do you regularly work on multiple projects at the same time?'


sheet_ready_question_twenty_eight = reshape_and_show_info( question_str, 245, 246)
sheet_ready_question_twenty_eight.head(10)



---Initial Shape---
(18431, 13)
---Shape after reshaping---
(18431, 1)
Yes, I work on many different projects               7756
Yes, I work on one main and several side projects    7611
No,  I only work on one project                      3064
Name: Do you regularly work on multiple projects at the same time?, dtype: int64


question,Do you regularly work on multiple projects at the same time?
0,"Yes, I work on one main and several side projects"
1,"Yes, I work on many different projects"
2,"Yes, I work on one main and several side projects"
3,"Yes, I work on many different projects"
4,"Yes, I work on many different projects"
5,"Yes, I work on many different projects"
6,"Yes, I work on many different projects"
7,"Yes, I work on many different projects"
8,"No, I only work on one project"
9,"Yes, I work on one main and several side projects"


In [362]:
# Processing muti-chioce Question 29

question_str = 'When developing in Python, how often do'


sheet_ready_question_twenty_nine = reshape_and_show_info( question_str,246, 261)
sheet_ready_question_twenty_nine.head(10)


---Initial Shape---
(18431, 13)
---Shape after reshaping---
(206902, 1)
Often                         83423
Never or<br />Almost never    63155
From time<br />to time        60324
Name: When developing in Python, how often do, dtype: int64


question,"When developing in Python, how often do"
0,Often
3,Never or<br />Almost never
4,Often
8,Often
10,Often
12,Never or<br />Almost never
13,From time<br />to time
15,Often
16,Often
17,Often


In [372]:
PYTHON_OBSERVATIONS_DF.iloc[:, 261].value_counts(dropna=False)

Friend / Colleague                 4828
Search engines                     2708
School / University                2575
I don't remember                   2467
Technical review / Forum / Blog    2273
Social network                     1259
Other – Write In:                  1134
Conference / User Group             525
NaN                                 353
Advertising                         309
Name: How did you first learn aboutyour main ide/editor?, dtype: int64

In [376]:
PYTHON_OBSERVATIONS_DF.iloc[:, 218].value_counts(dropna=False)

PyCharm Professional Edition             3581
VS Code                                  2952
PyCharm Community Edition                2641
Vim                                      1831
Sublime Text                             1545
Jupyter Notebook                          976
Atom                                      974
IDLE                                      664
Emacs                                     564
Spyder                                    463
Other – Write In:                         450
IntelliJ IDEA                             373
NotePad++                                 343
Eclipse + Pydev                           312
None                                      309
Python Tools for Visual Studio (PTVS)     128
Wing IDE                                   82
Gedit                                      80
NetBeans                                   45
Komodo Editor                              28
TextMate                                   23
Komodo IDE                        

In [398]:
obs = PYTHON_OBSERVATIONS_DF.iloc[:, [218,261]]
obs[ obs.iloc[:, 0] == 'None' ]

Unnamed: 0,What is the main editor you use for your current Python development?,How did you first learn aboutyour main ide/editor?
2468,,
2656,,
3201,,
3302,,
3466,,
3794,,
3835,,
3920,,
3938,,
4011,,


In [411]:
obs[pd.isna( obs.iloc[:, -1] )].iloc[:, 0].value_counts(dropna=False)
# obs[ isinstance(obs.iloc[:, -1], float)].iloc[:, -1].value_counts(dropna=False)

None                                     308
PyCharm Community Edition                  7
PyCharm Professional Edition               7
IDLE                                       5
Sublime Text                               4
Atom                                       4
VS Code                                    4
Eclipse + Pydev                            3
Vim                                        3
NotePad++                                  3
Spyder                                     2
Python Tools for Visual Studio (PTVS)      1
Komodo Editor                              1
IntelliJ IDEA                              1
Name: What is the main editor you use for your current Python development?, dtype: int64

In [414]:
PYTHON_OBSERVATIONS_DF.iloc[:, 264].value_counts(dropna=False)

NaN                    10312
2-7 people              6036
8-12 people             1370
13-20 people             429
21-40 people             158
More than 40 people      126
Name: How many people are in your project team?, dtype: int64

In [415]:


PYTHON_OBSERVATIONS_DF.iloc[:, 263].value_counts(dropna=False)

Work in a team                               8879
Work on your own project(s) independently    8846
Work as an external consultant or trainer     706
Name: Most of the time, do you ...?, dtype: int64

In [420]:
# 263
obsve_32 = PYTHON_OBSERVATIONS_DF.iloc[:, 263:265]
obsve_32[obsve_32.iloc[:, 0] == 'Work in a team'].iloc[:,-1].value_counts(dropna=False)

2-7 people             6036
8-12 people            1370
NaN                     760
13-20 people            429
21-40 people            158
More than 40 people     126
Name: How many people are in your project team?, dtype: int64

In [422]:
obsve_32[pd.isna(obsve_32.iloc[:, -1])].iloc[:,0].value_counts(dropna=False)

Work on your own project(s) independently    8846
Work in a team                                760
Work as an external consultant or trainer     706
Name: Most of the time, do you ...?, dtype: int64

In [425]:
# 266
PYTHON_OBSERVATIONS_DF.iloc[:, 265].value_counts(dropna=False)

Fully employed by a company / organization                                                                 11353
Student                                                                                                     3535
Freelancer <em>(a person pursuing a profession without a long-term commitment to any one employer)</em>     1136
Self-employed <em>(a person earning income directly from one's own business, trade, or profession)</em>     1072
Partially employed by a company / organization                                                               795
Other – Write In:                                                                                            390
Retired                                                                                                      150
Name: What is your employment status?, dtype: int64

In [426]:
PYTHON_OBSERVATIONS_DF.iloc[:, 266].value_counts(dropna=False)

NaN                6302
51–500             3030
11–50              2338
More than 5,000    2330
2–10               1484
1,001–5,000        1255
501–1,000           806
Just me             609
Not sure            277
Name: How many people work for your company / organization?, dtype: int64

In [433]:
obser = PYTHON_OBSERVATIONS_DF.iloc[:, [266]]
obser

Unnamed: 0,How many people work for your company / organization?
1404,"More than 5,000"
1405,Just me
1406,51–500
1407,51–500
1408,11–50
1409,
1410,
1411,
1412,
1413,"More than 5,000"


In [434]:

for index, columns in enumerate(raw_data.columns):
    print('#{}: {}'.format(index+1,  columns))
    

#1: Unnamed: 0
#2: Is Python the main language you use for your current projects?
#3: None:What other language(s) do you use?
#4: Java:What other language(s) do you use?
#5: JavaScript:What other language(s) do you use?
#6: C/C++:What other language(s) do you use?
#7: PHP:What other language(s) do you use?
#8: C#:What other language(s) do you use?
#9: Ruby:What other language(s) do you use?
#10: Bash / Shell:What other language(s) do you use?
#11: Objective-C:What other language(s) do you use?
#12: Go:What other language(s) do you use?
#13: Visual Basic:What other language(s) do you use?
#14: Scala:What other language(s) do you use?
#15: SQL:What other language(s) do you use?
#16: Kotlin:What other language(s) do you use?
#17: R:What other language(s) do you use?
#18: Swift:What other language(s) do you use?
#19: Clojure:What other language(s) do you use?
#20: Perl:What other language(s) do you use?
#21: Rust:What other language(s) do you use?
#22: Groovy:What other language(s) do you 

In [437]:
PYTHON_OBSERVATIONS_DF.iloc[:, [268]]

Unnamed: 0,Choose one from the list::Which of the following industries do you develop for?
1404,
1405,
1406,Other
1407,
1408,
1409,
1410,
1411,
1412,
1413,


In [438]:
!pwd

/Users/ogbuejioforchidiebere/Documents/Programming/Python/Practice/Data Science/JBeansData
