# Project - Data Science Salaries

![Data Science Workflow](img/ds-workflow.png)

## Goal of project
- The goal of this project is to present insightful statistics of Data Science Salaries
- The local newspaper (or online site) want to write an article on how lucrative it is to be Data Science

## Step 1: Acquire
- Explore problem
- Identify data
- Import data

### Step 1.a: Import libraries
- Execute the cell below (SHIFT + ENTER)

### Step 1.b: Read the data
- Use ```pd.read_csv()``` to read the file `files/data_science_salaries.csv`
- NOTE: Remember to assign the result to a variable (e.g., ```data```)
- Apply ```.head()``` on the data to see all is as expected
- Dataset is from **Kaggle** (Get updated dataset [here](https://www.kaggle.com/saurabhshahane/data-science-jobs-salaries))

### Step 1.c: Inspect the data
- Check the size of the dataset
- Can you make conclusions based on it?

## Step 2: Prepare
- Explore data
- Visualize ideas
- Cleaning data

### Step 2.a: Check the data types
- This step tells you if some numeric column is not represented numeric.
- Get the data types by ```.dtypes```

### Step 2.b: Check for null (missing) values
- Data often is missing entries - there can be many reasons for this
- We need to deal with that (will do later in course)
- Use ```.isnull().any()```

### Step 2.c: Understand features
- Most features has categories
- A way to categorize them is by using `data['work_year'].unique()`
- Similar for the other categories. Example:
    - `experience_level`: 
        - EN: Entry-level / Junior
        - MI: Mid-level / Intermediate
        - SE: Senior-level / Expert
        - EX: Executive-level / Director
- See full description on [Kaggle](https://www.kaggle.com/saurabhshahane/data-science-jobs-salaries)

### Step 2.d: Salaries
- Notice that salaries are given in different currencies
- Also, notice `salary_in_usd`

## Step 3: Analyze
- Feature selection
- Model selection
- Analyze data

### Step 3.a: Explore features
- One way to explore features is as follows (here we explore `experience_level`)
```Python
data.groupby('experience_level')['salary_in_usd'].describe()
```
- Explore other features

### Step 3.b: Explore data on two columns
- Say you want to investigate two columns: `experience_level` and `company_size`
```Python
data.groupby(['experience_level', 'company_size'])['salary_in_usd'].mean()
```
- Try similar for other combinations

### Step 3.c: Describe data on two columns
- How does the spread look like.
- Can we conclude anything based on data
```Python
data.groupby(['company_size', 'experience_level'])['salary_in_usd'].describe()
```

### Step 3.d: Visualize the description
- What does this tell you
```Python
data.boxplot(column='salary_in_usd', by='company_size')
```
- Do this for other features of your interest

## Step 4: Report
- Present findings
- Visualize results
- Credibility counts

### Step 4.a: Present your findings
- Here we focus on `company_size` and `experience_level`
- Create a data frame for the data to plot.
    - This makes easy to re-order index and column
- Notice: Do it for the features you want to present
```Python
data.groupby(['company_size', 'experience_level'])['salary_in_usd'].mean().unstack()
```
- [unstack()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.unstack.html) unstacks multiindex

### Step 4.b: Re-order index and columns
- We do this to present data in a logical way
- Use `reindex(index=['S', 'M', 'L'])` (assuming the same example)
- Re-order columns simply by filtering with `['EN', 'MI', 'SE', 'EX']`

### Step 4.c: Visualize results
- Visualize your result with a bar-plot
    - HINT: `plot.bar()`
- Finalize with title and labels

### Step 4.d: Credability considerations
- With the insights we have from our analysis - could we tell another story?
- Examples:
    - Spread of salary
    - Outliers
    - Size of dataset and categories used

## Step 5: Actions
- Use insights
- Measure impact
- Main goal

### Step 5.a:
- How could we use insights?
- How to measure it?