# Stack Overflow Developer Survey Data (2011-2022) Analysis

This notebook presents project methodology and purposes.
It loads data from files into Pandas Dataframes, ready to be analyzed.
Here will follow the general description of the methodology used in the project.


## CRISP-DM analysis
In the following paragraphs I'll approach the dataset using the CRISP-DM methodology:
CRISP-DM stands for Cross-Industry Standard Process for Data Mining and is a process structured in 5 steps:
1. Business Understanding (common)
2. Data Understanding (common)
3. Data Preparation (question specific)
4. Data Modeling (question specific, not implemented in this project)
5. Results Analysis (question specific)
6. Deployment (question specific, not implemented in this project)


## 1. Business Understanding

In this section, I will formulate some questions about the Stack Overflow survey dataset.

For this purpose, I will use data from the following list of years:
1. 2011
2. 2012
3. 2013
4. 2014
5. 2015
6. 2016
7. 2017
8. 2018
9. 2019
10. 2020
11. 2021
12. 2022

The questions will be the following:<br/>
1. Which languages were the most popular each year? This could be done with barcharts/countplots. . What trends are in the top 10 languages' popularity? This could be made by common plots of share percentages througouth years.
2. Referring to some specific platform like Android, are there any visible shifts in languages popularity between two or more of the top ten languages over the years? This could be done by using targeted barcharts. The case of Kotlin vs Java is interesting.
3. Does the number of years in programming influence the preferred/mostly used language? This could be done using scatterplot or heatmaps... Mabye also have a look at Violin/Box Plots. Faceting? Adaptation of Univariate Plots? I can use the average of the years in programming on Y axis. This is qualitative (most used language) vs quantitative (number of years in programming)
4. Does the developer's principal language(s) influence the desire to learn a specific language in the future? This could be done usign scatterplot too? Maybe it is better to explore correlation with other features too.


## 2. Data Understanding

In this section, I will take a look at the data and check about the above formulated questions,
exploring dataframes features


### Data Gathering and Access

The data has to be loaded from CSV files, first.

So I will use a support function to do this in the form of a dictionary, whose keys will be the year of reference

Before proceeding, I need to import pandas libraries:

In [None]:
import pandas as pd

Now I'll load data from CSV files:

In [None]:
# reading survey data from oldest to newest
from preparation.data_load import load_surveys_data_from_csv
df_surveys_11_to_22 = load_surveys_data_from_csv(years=range(2011, 2023))

### Data Exploration
In this section I'll have a look at each dataframe comprising data from each year.

#### Preliminary operations

Before proceeding with data exploration, I need to get dimensionality parameters from resulting pandas Dataframe, in order to
guarantee correct display of Dataframe during data exploration phases.

First I will compute these parameters on dataframe:

In [None]:
from preparation.data_load import get_dataset_max_shapes
max_rows, max_cols = get_dataset_max_shapes(df_surveys_11_to_22)

Then I will set these parameters on the Notebook:

In [None]:
# setting the option from dataset computed parameters
pd.set_option('display.max_columns', max_rows)
pd.set_option('display.max_columns', max_cols)

Now I'll go through dataset by year, dataframe by dataframe, looking at features of each of them.

In [None]:
# 2011 survey data dataframe
df_surveys_11_to_22[2011].head(3)

In [None]:
# 2012 survey data dataframe
df_surveys_11_to_22[2012].head(3)

In [None]:
# 2013 survey data dataframe
df_surveys_11_to_22[2013].head(3)

In [None]:
# 2014 survey data dataframe
df_surveys_11_to_22[2014].head(3)

In [None]:
# 2015 survey data dataframe
df_surveys_11_to_22[2015].head(3)

In [None]:
# 2016 survey data dataframe
df_surveys_11_to_22[2016].head(3)

In [None]:
# 2017 survey data dataframe
df_surveys_11_to_22[2017].head(3)

In [None]:
# 2018 survey data dataframe
df_surveys_11_to_22[2018].head(3)

In [None]:
# 2019 survey data dataframe
df_surveys_11_to_22[2019].head(3)

In [None]:
# 2020 survey data dataframe
df_surveys_11_to_22[2020].head(3)

In [None]:
# 2021 survey data dataframe
df_surveys_11_to_22[2021].head(3)

In [None]:
# 2022 survey data dataframe
df_surveys_11_to_22[2022].head(3)

To share the loaded data between notebooks, I will run a special line

In [None]:
%store df_surveys_11_to_22