# Data Science Workflow

![Data Science Workflow](img/ds-workflow.png)

## Step 1: Acquire
### Explore Problem
#### Data Science: Understanding the Problem *(Lesson 00)*
- Get the right question:
    - What is the **problem** we try to **solve**?
    - This forms the **Data Science problem**
    - **Examples**
        - Sales figure and call center logs: evaluate a new product
        - Sensor data from multiple sensors: detect equipment failure
        - Customer data + marketing data: better targeted marketing
- **Assess situation**
    - Risks, Benefits, Contingencies, Regulations, Resources, Requirement
- **Define goal**
    - What is the **objective**?
    - What is the **success criteria**?
- **Conclusion**
    - Defining the problem is key to successful Data Science projects

### Identify Data
#### Great Places to Find Data *(Lesson 05)*
- [UC Irvine Machine Learning Repository!](https://archive.ics.uci.edu/ml/index.php)
- [KD Nuggets](https://www.kdnuggets.com/datasets/index.html) Datasets for Data Mining, Data Science, and Machine Learning
    - [KD Nuggets](https://www.kdnuggets.com/datasets/government-local-public.html) Government, State, City, Local and Public
    - [KD Nuggets](https://www.kdnuggets.com/datasets/api-hub-marketplace-platform.html) APIs, Hubs, Marketplaces, and Platforms
    - [KD Nuggets](https://www.kdnuggets.com/competitions/index.html) Analytics, Data Science, Data Mining Competitions
- [data.gov](https://www.data.gov) The home of the U.S. Government’s open data
- [data.gov.uk](https://data.gov.uk) Data published by central government
- [World Health Organization](https://www.who.int/data/gho) Explore a world of health data
- [World Bank](https://data.worldbank.org) source of world data
- [Kaggle](https://www.kaggle.com) is an online community of data scientists and machine learning practitioners.

### Import Data
####  Read CSV files *(Lesson 05)*
- Comma-Seperated Values ([Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values))
-  Learn more about Excel processing [in this YouTube lesson on CSV](https://youtu.be/LEyojSOg4EI)
- [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html): read a comma-separated values (csv) file into **pandas** DataFrame.
```Python
import pandas as pd
data = pd.read_csv('files/aapl.csv', parse_dates=True, index_col=0)
```

#### Excel files *(Lesson 05)*
- Most videly used [spreadsheet](https://en.wikipedia.org/wiki/Spreadsheet)
- Learn more about Excel processing [in this lecture](https://www.learnpythonwithrune.org/csv-groupby-processing-to-excel-with-charts-using-pandas-python/)
- [`read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) Read an Excel file into a pandas DataFrame.
```Python
data = pd.read_excel('files/aapl.xlsx', index_col='Date')
```

#### Parquet files *(Lesson 05)*
- [Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) is a free open source format
- Compressed format
- [`read_parquet()`](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html) Load a parquet object from the file path, returning a DataFrame.
```Python
data = pd.read_parquet('files/aapl.parquet')
```

#### Web Scraping *(Lesson 03)*
- Extracting data from websites
- Leagal issues: [wikipedia.org](https://en.wikipedia.org/wiki/Web_scraping#Legal_issues)
- [`read_html()`](https://pandas.pydata.org/docs/reference/api/pandas.read_html.html) Read HTML tables into a list of DataFrame objects.
```Python
url = "https://en.wikipedia.org/wiki/Wikipedia:Fundraising_statistics"
data = pd.read_html(url)
```

#### Databases *(Lesson 04)*
- [`read_sql()`](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html) Read SQL query or database table into a DataFrame.
- The [sqlite3](https://docs.python.org/3/library/sqlite3.html) is an interface for SQLite databases.
```Python
import sqlite3
import pandas as pd
conn = sqlite3.connect('files/dallas-ois.sqlite')
data = pd.read_sql('SELECT * FROM officers', conn)
```

### Combine Data *(Lesson 06)*
- Often we need to combine data from different sources

#### pandas DataFrames
- pandas DataFrames can combine data ([pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf))
- `concat([df1, df2], axis=0)`: [concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) Concatenate pandas objects along a particular axis 
- `df.join(other.set_index('key'), on='key')`: [join](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html) Join columns of another DataFrame.
- `df1.merge(df2, how='inner', on='a')` [merge](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) Merge DataFrame or named Series objects with a database-style join

## Step 2: Prepare
### Explore Data
#### Simple Exploration
- [`head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) Return the first n rows.
- [`.shape`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) Return a tuple representing the dimensionality of the DataFrame.
- [`.dtypes`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html) Return the dtypes in the DataFrame.
- [`info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) Print a concise summary of a DataFrame.
- [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) Generate descriptive statistics.
- [`isna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html).[`any()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.isna.html) Returns if any element is missing.

#### Groupby, Counts and Statistics *(Lesson 08)*
- Count groups to see the significance across results
```Python
data.groupby('Gender').count()
```
- Return the mean of the values over the requested axis.
```Python
data.groupby('Gender').mean()
```
- Standard Deviation
    - **Standard deviation** is a measure of how dispersed (spread) the data is in relation to the mean.
    - Low **standard deviation** means data is close to the mean.
    - High **standard deviation** means data is spread out.
![Standard deviation](img/std-diagram.png)
```Python
data.groupby('Gender').std()
```
- Box plots
    - Box plots is a great way to visualize descriptive statistics
    - Notice that Q1: 25%, Q2: 50%, Q3: 75%

![Box plots](img/box-plot.png)

- Make a box plot of the DataFrame columns [plot.box()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.box.html)

```Python
data.boxplot()
```

### Visualize Data *(Lesson 01)*
#### Simple Plot
```Python
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
data['USA'].plot()
```
- Adding title and labels
    - ```title='Tilte'``` adds the title
    - ```xlabel='X label'``` adds or changes the X-label
    - ```ylabel='X label'``` adds or changes the Y-label
```Python
data['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)')
```
- Adding ranges
    - ```xlim=(min, max)``` or ```xlim=min``` Sets the x-axis range
    - ```ylim=(min, max)``` or ```ylim=min``` Sets the y-axis range
```Python
data['USA'].plot(title='US CO2 per capita', ylabel='CO2 (metric tons per capita)', ylim=0)
```
- Comparing data
```Python
data[['USA', 'WLD']].plot(ylim=0)
```

#### Scatter Plot
- Good to see any connection
```Python
data = pd.read_csv('files/sample_corr.csv')
data.plot.scatter(x='x', y='y')
```

#### Histogram
- Identifying quality
```Python
data = pd.read_csv('files/sample_height.csv')
data.plot.hist()
```
- Identifying outliers
```Python
data = pd.read_csv('files/sample_age.csv')
data.plot.hist()
```
- Setting bins and figsize
```Python
data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
data['USA'].plot.hist(figsize=(20,6), bins=10)
```

#### Bar Plot
- Normal plot
```Python
data = pd.read_csv('files/WorldBank-ATM.CO2E.PC_DS2.csv', index_col=0)
data['USA'].plot.bar()
```
- Range and columns, figsize and label
```Python
data[['USA', 'DNK']].loc[2000:].plot.bar(figsize=(20,6), ylabel='CO emmission per capita')
```

#### Pie Chart
- Presenting
```Python
df = pd.Series(data=[3, 5, 7], index=['Data1', 'Data2', 'Data3'])
df.plot.pie()
```
- Value counts in Pie Charts
    - ```colors=<list of colors>```
    - ```labels=<list of labels>```
    - ```title='<title>'```
    - ```ylabel='<label>'```
    - ```autopct='%1.1f%%'``` sets percentages on chart
```Python
(data['USA'] < 17.5).value_counts().plot.pie(colors=['r', 'g'], labels=['>= 17.5', '< 17.5'], title='CO2', autopct='%1.1f%%')
```

### Cleaning Data *(Lesson 09)*
- [`dropna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) Remove missing values.
- [`fillna()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) Fill NA/NaN values using the specified method.
    - Example: Fill missing values with mean.
```Python
data = data.fillna(data.mean())
```
- [`drop_duplicates()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) Return DataFrame with duplicate rows removed.
- Working with time series
    - [`reindex()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reindex.html) Conform Series/DataFrame to new index with optional filling logic.
    - [`interpolate()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html) Fill NaN values using an interpolation method.
- Resources
    - pandas user guide: [Working with missing data](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html)


## Step 3: Analyze
### Split into Train and Test *(Lesson 10)*
- Assign independent features (those predicting) to `X`
- Assign classes (labels/dependent features) to `y`
- Divide into training and test sets
```Python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

### Feature Scaling *(Lesson 11)*
- **Feature Scaling** transforms values in the similar range for machine learning algorithms to behave optimal.
- **Feature Scaling** can be a problems for **Machine Learing** algorithms on multiple features spanning in different magnitudes.
- **Feature Scaling** can also make it is easier to compare results
#### Feature Scaling Techniques
- **Normalization** is a special case of **MinMaxScaler**
    - **Normalization**: Converts values between 0-1
```Python
(values - values.min())/(values.max() - values.min())
```
    - **MinMaxScaler**: Between any values
- **Standardization** (**StandardSclaer** from sklearn)
    - Mean: 0, StdDev: 1
```Python
(values - values.mean())/values.std()
```
    - Less sensitive to outliers

#### Normalization
- [`MinMaxScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) Transform features by scaling each feature to a given range.
- `MinMaxScaler().fit(X_train)` is used to create a scaler.
    - Notice: We only do it on training data
```Python
from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler().fit(X_train)
X_train_norm = norm.transform(X_train)
X_test_norm = norm.transform(X_test)
```

#### Standarization
- [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) Standardize features by removing the mean and scaling to unit variance.
```Python
from sklearn.preprocessing import StandardScaler
scale = StandardScaler().fit(X_train)
X_train_stand = scale.transform(X_train)
X_test_stand = scale.transform(X_test)
```

### Feature Selection *(Lesson 12)*
- **Feature selection** is about selecting attributes that have the greatest impact towards the **problem** you are solving.

#### Why Feature Selection?
- Higher accuracy
- Simpler models
- Reducing overfitting risk

#### Feature Selection Techniques

##### Filter methods
- Independent of Model
- Based on scores of statistical
- Easy to understand
- Good for early feature removal
- Low computational requirements

##### Examples
- [Chi square](https://en.wikipedia.org/wiki/Chi-squared_test)
- [Information gain](https://en.wikipedia.org/wiki/Information_gain_in_decision_trees)
- [Correlation score](https://en.wikipedia.org/wiki/Correlation_coefficient)
- [Correlation Matrix with Heatmap](https://vitalflux.com/correlation-heatmap-with-seaborn-pandas/)

##### Wrapper methods
- Compare different subsets of features and run the model on them
- Basically a search problem

##### Examples
- [Best-first search](https://en.wikipedia.org/wiki/Best-first_search)
- [Random hill-climbing algorithm](https://en.wikipedia.org/wiki/Hill_climbing)
- [Forward selection](https://en.wikipedia.org/wiki/Stepwise_regression)
- [Backward elimination](https://en.wikipedia.org/wiki/Stepwise_regression)

See more on [wikipedia](https://en.wikipedia.org/wiki/Feature_selection#Subset_selection)

##### Embedded methods
- Find features that contribute most to the accuracy of the model while it is created
- Regularization is the most common method - it penalizes higher complexity

##### Examples
- [LASSO](https://en.wikipedia.org/wiki/Lasso_(statistics))
- [Elastic Net](https://en.wikipedia.org/wiki/Elastic_net_regularization)
- [Ridge Regression](https://en.wikipedia.org/wiki/Ridge_regression)

#### Remove constant and quasi constant features
- [`VarianceThreshold`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html) Feature selector that removes all low-variance features.
```Python
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold()
sel.fit_transform(data)
```
#### Remove correlated features
- The goal is to find and remove correlated features
- Calcualte correlation matrix (assign it to `corr_matrix`)
- A feature is correlated to any previous features if the following is true
    - Notice that we use correlation 0.8
```Python
corr_features = [feature for feature in corr_matrix.columns if (corr_matrix[feature].iloc[:corr_matrix.columns.get_loc(feature)] > 0.8).any()]
```

### Model Selection ([Lesson 13]())
- The process of selecting the model among a collection of candidates machine learning models

#### Problem type
- What kind of problem are you looking into?
    - **Classification**: *Predict labels on data with predefined classes*
        - Supervised Machine Learning
    - **Clustering**: *Identify similarieties between objects and group them in clusters*
        - Unsupervised Machine Learning
    - **Regression**: *Predict continuous values*
        - Supervised Machine Learning
- Resource: [Sklearn cheat sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

#### Model Selection Techniques
- **Probabilistic Measures**: Scoring by performance and complexity of model.
- **Resampling Methods**: Splitting in sub-train and sub-test datasets and scoring by mean values of repeated runs.

#### A few models
- [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) Ordinary least squares Linear Regression ([Lesson 08]()).
```Python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
lin = LinearRegression()
lin.fit(X_train, y_train)
y_pred = lin.predict(X_test)
r2_score(y_test, y_pred)
```
- [`SVC`](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) C-Support Vector Classification ([Lesson 10]()).
```Python
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import accuracy_score
svc = LinearSVC()
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
accuracy_score(y_test, y_pred)
```
- [`KNeighborsClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) Classifier implementing the k-nearest neighbors vote ([Lesson 10]()).
```Python
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
neigh = KNeighborsClassifier()
neigh.fit(X_train.fillna(-1), y_train)
y_pred = neigh.predict(X_test.fillna(-1))
accuracy_score(y_test, y_pred)
```

### Analyze Result
This is the main **check-point** of your analysis.
- Review the **Problem** and **Data Science problem** you started with.
    - The analysis should add value to the **Data Science Problem**
    - Sometimes our focus drifts - we need to ensure alignment with original **Problem**.
    - Go back to the **Exploration** of the **Problem** - does the result add value to the **Data Science Problem** and the initial **Problem** (which formed the **Data Science Problem**)
    - *Example:* As Data Scientist we often find the research itself valuable, but a business is often interested in increasing revenue, customer satisfaction, brand value, or similar business metrics.
- Did we learn anything?
    - Does the **Data-Driven Insights** add value?
    - *Example:* Does it add value to have evidence for: Wealthy people buy more expensive cars.
        - This might add you value to confirm this hypothesis, but does it add any value for car manufacturer?
- Can we make any valuable insights from our analysis?
    - Do we need more/better/different data?
    - Can we give any Actionable Data Driven Insights?
    - It is always easy to want better and more accurate high quality data.
- Do we have the right features?
    - Do we need eliminate features?
    - Is the data cleaning appropriate?
    - Is data quality as expected?
- Do we need to try different models?
    - Data Analysis is an iterative process
    - Simpler models are more powerful
- Can result be inconclusive?
    - Can we still give recommendations?

#### Quote
> *“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”* 
> - Sherlock Holmes
 
#### Iterative Research Process
- **Observation/Question**: Starting point (could be iterative)
- **Hypothesis/Claim/Assumption**: Something we believe could be true
- **Test/Data collection**: We need to gether relevant data
- **Analyze/Evidence**: Based on data collection did we get evidence?
    - Can our model predict? (a model is first useful when it can predict)
- **Conclude**: *Warning!* E.g.: We can conclude a correlation (this does not mean A causes B)
    - Example: Based on the collected data we can see a correlation between A and B

## Step 4: Report
### Present Findings
- You need to *sell* or *tell* a story with the findings.
- Who is your **audience**?
    - Focus on technical level and interest of your audience
    - Speak their language
    - Story should make sense to audience
    - Examples
        - **Team manager**: Might be technical, but often busy and only interested in high-level status and key findings.
        - **Data engineer/science team**: Technical exploration and similar interest as you
        - **Business stakeholders**: This might be end-customers or collaboration in other business units.
- When presenting
    - **Goal**: Communicate actionable insights to key stakeholders
    - Outline (inspiration):
        - **TL;DR** (Too-long; Didn’t read) - clear and concise summary of the content (often one line) that frames key insights in the context of impact on key business metrics.
        - Start with your understanding of the business problem
        - How does it transform into a Data Science Problem
        - How will to measure impact - what business metrics are indicators of results
        - What data is available and used
        - Presenting hypthosis of reseach
        - A visual presentation of the insights (model/analysis/key findings)
            - This is where you present the evidence for the insights
        - How to use insight and create actions
        - Followup and continuous learning increasing value

### Visualize Results
- Telling a story with the data
- This is where you convince that the findings/insights are correct
- The right visualization is important
    - Example: A correlation matrix might give a Data Engineer insights in how findings where discovered, but confuse business partners.

#### Resources for visualization
- [Seaborn](https://seaborn.pydata.org) Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- [Plotly](https://plotly.com) open-source for analytic apps in Python
- [Folium](http://python-visualization.github.io/folium/) makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map.

### Credibility Counts
- This is the check point if your research is valid
    - Are you hiding findings you did not like (not supporting your hypothesis)?
    - Remember it is the long-term relationship that counts
- Don't leave out results
    - We learn from data and find hidden patterns, to make data-driven decisions, with a long-term perspective

## Step 5: Actions
### Use Insights
- How do we follow up on the presented **Insights**?
- **No one-size-fits-all**: It depends on the **Insights** and **Problem**
- *Examples:*
    1. **Problem**: What customers are most likely to cancel subscription?
        - Say, we have insufficient knowledge of customers, and need to get more, hence we have given recommendations to gather more insights
        - But you should still try to add value
    2. **Problem**: Here is our data - find valuable insights!
        - This is a challenge as there is no given focus
        - An iterative process involving the customer can leave you with no surprises

### Measure Impact
- If customer cannot measure impact of your work - they do not know what they pay for.
    - If you cannot measure it - you cannot know if hypothesis are correct.
    - A model is first valuable when it can be used to predict with some certainty
- There should be identified metrics/indicators to evaluate in the report
- This can evolve - we learn along the way - or we could be wrong.
- How long before we expect to see impact on identified business metrics?
- What if we do not see expected impact?
- Understanding of metrics
    - The metrics we measure are indicators that our hypthesis is correct
    - Other aspects can have impact on the result - but you need to identify that
    
### Main Goal
- Your success of a Data Scientist is to create valuable actionable insights

#### A great way to think
- Any business/organisation can be thought of as a complex system
    - Nobody understands it perfectly and it evolves organically
- Data describes some aspect of it
- It can be thought of as a black-box
- Any insights you can bring is like a window that sheds light on what happens inside

## General Advice
- **Expectations**
    - When I started my PhD (researcher) journey I expected to solve big problems - change the world to a better place
    - Reality was different - small incremental contributions
    - Start with simple interesting problems - do not expect to find insights that will change the world from day one.
- **Learning**
    - This is a new field - but like any research field, it evolves and we learn new techniques and get new tools
    - This course gave a you a solid basis, but there is a lot more to learn
    - Don't expect your learning to end
- **Long-term focus**
    - Be clear on your goal: Become a Data Scientist
    - This will help you when things seems difficult - everyone has times of struggle
    - Don't get discouraged by seeing someone else present some awesome work - learn from it
- **Curiosity**
    - I always say **keep it playful**
    - You need to enjoy what you do
    - Most people are curious - so let your curiosity guide you on your Data Science journey