# Data Science for Manufacturing - Workshop 4-1: More tools for EDA

## Objectives

Exploratory Data Analysis (EDA) with two examples
- Investigate and understand data (week 3 notebook)
- Cleanse and organise data (week 3 notebook)
- Univariate analysis
- Bivariate analysis
- Multivariate analysis


Resource for Python visualisations:  
Python has plenty of pre-coded visualizations coming in different packages. A list of the most important graphics, alongside with pointers to their code and exampls, is found here: https://python-graph-gallery.com

Python provides several packages for visualization: https://blog.modeanalytics.com/python-data-visualization-libraries.



## 1. Screws dataset

First, lets import the cleaned data from previous week.



What we know for this dataset, same as last week.
- Sizes should be in mm
- All values apart from IDs and type are floats or integers
- All IDs start with B followed by a number

### 1.1 Basic statistics of the dataset

? Can you print a concise summary of the dataframe?

**Tip**: Look for a method that prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

? Can you calculate the min, max, and mean on each column with a numeric (quantiative) value?
- Method 1: use pandas methods in the format of `column.mean()`
- Method 2: use numpy functions in the format of `np.mean()`

? Can you now answer the following questions:
- How many screws of each type are in the dataset?
- How many screws of each type have diameters equal to 18mm?

**Tip**: filtering rows using boolean operations `df[df['columnName'] == value]`

### ££ 1.1.1 Logical operations
Logical operations allow you to combine multiple conditions for selecting or filtering data.  
With Pandas dataframes and in general Numpy arrays, logical operators are not: `and`, `or`, `not`.
- they should be `&`, `|`, `~`
- each condition should be put in parentheses, e.g. `df[(df['type']=='Metric screw') & (df['diameter'] == 18)]`
- alternatives to `&`, `|`, `~` are `numpy.logical_and()`, `numpy.logical_or()`, `numpy.logical_not()`
These logical operators are specified for element-wise operations.

Answering the second question is a great opportunity to to utilise all contents from week 1!

### 1.2 Groupby

Looking at the dataset based on different types of screws in this dataset, or different values in a column in general, is a common operation.  
There is a specific function in Pandas to deal with this: `dataframe.groupby(as_index=True)`
- `as_index=True` means the group key is set as index in the returned result

<br>

Useful methods associated with `groupby()`
- `value_counts()`: summarise different values
- `sum()`: sum up values (In the case of values are boolean, add up `True` values, because `False` is 0 and `True` is 1.)
- `max()`: compute max of group values

£ Note: the `max()` method returns values that are maximal in each column. The returned values may not be an original row.

`dataframe.groupby( )` can take in multiple group keys

### 1.3 Univariate analysis

£ `%matplotlib inline` here tells Jupyter Notebook to display directly in the notebook. If this magic command is not excecuted or not applicable in some Python development environments, then `plt.show( )` is needed everytime whenever to display the plot.

#### 1.3.1 Histograms

Plot the distribution of diameters:
- Method 1: directly use matplotlib
- Method 2: use seaborn

Seaborn is a statistical data visualization library built on top of Matplotlib. While Matplotlib provides a basic set of plotting functionalities, Seaborn enhances and simplifies the creation of complex statistical visualizations by providing a high-level interface.

#### 1.3.2 * Desnsity plot
Plot distributions using kernel density estimation. KDE represents the data using a continuous probability density curve.

#### 1.3.3 barplot( )

- a statistical estimate for a numeric variable with the height of each rectangle, e.g. mean
- an indicate of the uncertainty around that estimate using an error bar, e.g. standard deviation

#### 1.3.4 Customise plotting
- Modify the size of the plot
- Add a title for the plot
- Add titles for axes

### 1.4 Bivariate analysis
£££ Note: Bivariate analysis can only be performed on two variables simultaneously. Comparing two univariate analysis plots together is not bivariate analysis.

#### 1.4.1 barplot( ) for bivariate analysis
For any biavraite analysis, there should be two variables put in the analysis method, common examples in seaborn is `x=df[column1], y=df[column2]`

#### 1.4.2 `scatterplot( )`

£ Note: the `hue=df['type']` argument acts on the display similar to `groupby(df['type'])`, which aims to show the data by types.

#### 1.4.3 `regplot( )`
Plot the scatter points and also a regression model

#### 1.4.4 £££ Correlation between two quantities

$\rho_{X,Y}=\mathrm{corr}(X,Y)=\frac{\mathrm{cov}(X,Y)}{\sigma_X \sigma_Y}=\frac{\mathrm{E}[(X-\mu_X)(Y-\mu_Y)]}{\sigma_X \sigma_Y}$  

>

$\mathrm{X}, \mathrm{Y}$ :  random variables, i.e. different columns of interest  
$\mathrm{corr}$ : correlation  
$\mathrm{cov}$ : covariance  
$\sigma_X, \sigma_Y$ : standard deviation of the random variables  
$\mathrm{E}$ : Expectation

To understand the correlation coefficient, there are two parts:
- the sign of the coefficient
- the (absolute) value of the coefficient

##### 1.4.4.1 The sign of correlation

Covariance decides the sign of the correlation coefficient sign. Covariance measures the degree to which two variables tend to increase or decrease simultaneously.


##### 1.4.4.2 The value of correlation
The correlation coefficient ranges from -1 to 1.

The (absolution) value of the coefficient:
- Does not describe the slope of the plots
- [Visualisations of different coefficient values](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient#/media/File:Correlation_examples2.svg)

### 1.5 Multivariate analysis (correlation focused)
The methods introduced in this section are more like 'bivariate analysis in batch'. Because when talking about correlations with more than two variables, what's actually happening is we check all pairwise correlations in the correlation matrix.  
But there certainly exist analysis methods where multiple variables are inputs simultaneously, e.g. PCA (principle component analysis), clustering, multivariate regression, etc. (Contents about these methods are in the coming weeks)


## 2. Time series dataset: exports

££ Transposing time series data (i.e. making the timestamps the row index) is a common practice. Reasons for transposing:
- Many time series analysis and plotting functions are designed to work efficiently with timestamps as row index.
- Time-based indexing makes it more intuitively to select and filter data. (Especially usefull when working with large time series datasets.)

### 2.1 Univariate analysis

#### 2.1.1 Lineplot

#### 2.1.2 Heatmap

### 2.2 Multivariate analysis