#  Data Science Overview

## The Data Science Process

1) Setting the research goal

2) Retrieving data

3) Data preparation 

4) Data exploration 

5) Data modelling

6) Presentation and automation

## Step 1: Setting the Research Goal   

Aim to get formal __agreement__ on __deliverables__ through a _project charter_, which may inclue:
    
- Statement of research __goal__

- Broader project missing and __context__

- Required __resources__ and __data__

- How __analysis__ will be performed 

- Proof that it's an __achievable__ project

- Measure of __success__

- Formal __deliverables__ (e.g. project report)

## Step 2: Identify and Retrieve Data

This is the first time you inspec the data in the data science process. Most of the errors here are easy to spot. 

Focus on:

- If the data is equal to the data in the __source document__ 

- If you have the right __data types__

Stop when you have enough evidence that the data is similar to the data you find in the source document. 

## Step 3: Data Preparation 

Here, do a more elaborate check of the data. The errors here should also be present in the __source document__.

Focus is on the __content__ of the variables:

- Typos: USQ to USA

- Other data entry errors

- Missing values 

- Inconsistencies: "F" or "Female"

- Transformation (e.g. total GDP to per capita GDP, categorical data to numerical

This phase often consues a lot of time, but it's vital.

__Garbage in, garbage out.__

## Step 4: Data Exploration 

Now, take a deep dive into the data using __graphical techniques__.

The aim is to:

- gain an __understanding__ of __each feature__ 

- gain an __understanding__ of the __interactions__ between features

- Include descriptive statistics: mean, median, mode, sd. 

This phase is about __exploring__ data, so keeping your mind open and your eyes peeled is important.

The goal isn't to cleanse the data, but it's common that you'll still discover anomalies you missed before, forcing you to take a step back and fix them.



## Step 5: Data Modelling 

The modelling phase consists of __four steps__:

- __Feature engineering__ and __model selection__

- __Training__ the model

- Model __validation and selection__

- Applying the trained model to __unseen data__

Before you find a good model, you'll probably __iterate__ among the first three steps. 

The __last step__ isn't always present because sometimes the goal isn't prediction but explanation (__root cause analysis__).

- For example, you might want to find out the __causes of species' extinctions__ but not necessarily predict which one is next in line to leave our planet.

### Engineering Features and Selecting a Model

With engineering features, you must come up with and creat __possible predictors__ for the model.

This is one of the most important steps in the process because a model __recombines these features__ to __achieve its predictions__. 

Often you may need to __consult an expert__ or __appropriate literature__ to come up with __meaningful features__.

### Training Your Model

In this phase, you __present your model with data__ from which it can learn.

The __most common modelling techniques__ have __industry-ready implementations__ in almost every programming language, including __Python__.

These enable you to __train__ your models by __executing a few lines of code__. 

Once a model is trained, it's time to test whether it can be extrapolated to reality (__model validation__).

### Validating a Model

Data science has many modelling techniques, and the question is __which one is the right one to use__?

A good model has two properties:

- It has __good predictive power__

- It __generalizes well__ to data it hasn't seen

To achieve this you define:

- An __error measure__ and (classification error rate or precision and recall)

- A __validation strategy__ 

Many validation strategies exist, including splitting and testing the data on:

- A simple train and test split

- K-folds cross validation 

- Leave-1 out validation 

__Train and test split__ trains the model on one training set and validates the model on one test test.

![title](traintest.jpg)

__K-folds cross validation__ divides the data set into k parts and uses each part one time as a test set while using the others as a training data set. This has the advantage that you use all the data available in the data set.

![title](kfold.jpg)

__Leave-1 out validation__ is the same as k-folds, but with k equal to the total number of examples. You always leave one observation out and train on the rest of the data. This is usually used only on small data sets, so it's more valuable to people evaulating laboratory experiments than to big data analysts.

![title](leave1.jpg)

## Step 6: Presenting/Reporting Findings

Once the modelling and data analysis is complete, the findings need to be presented to the stakeholders. 

Appropriate visual tools can enhance results presentation:

- Texts

- Tables

- Graphs

Overall it's important to tell a compelling story (a formal report).

### Automation

Sometimes, people want to __repeat__ your work over and over again, because they value the predictions of your models or the insights.

This __doesn't__ mean that:

- you have to __redo__ all of your analysis all the time

- sometimes it's sufficient that you implement only the __model scoring__

- other times you might build an application that __automatically updates__ reports or spread sheets