# Data Science for Manufacturing - Workshop 3-1: Data carpentry using Pandas

## Objectives

- Learn and use Pandas with a practical example
  - Investigate the dataset
  - Cleanse the dataset
    - Duplicates
    - Badly formed enties
    - Wrong data types
    - Missing values
  - Dataset analysis
    - Statistical disributions
    - Outlier


## 1. Introduce the dataset

![bolt dimensions from https://www.mudgefasteners.com/news/2020/12/30/how-to-measure-the-size-of-a-bolt](https://images.squarespace-cdn.com/content/v1/57fd5aa69f745699d45f362d/1609373976603-1EZYI2MGM5Z1V7I9KUQC/bolt-1.png?format=500w)

What we know for this dataset.
- Sizes should be in mm
- All values apart from IDs are floats or integers
- All IDs start with B followed by a string of integers

<br>

The goal of analysis:  
- Find out the distributions of screws in terms of different metrics
- Identify outliers

### 1.1 Loading tabular data using pandas

To begin processing data, we need to load it into Python. We can do that using the library pandas.

### 1.2 Investigating the data

Print out a subset of the datasets

Find out the datatypes, shapes

Check column names are the same
- Obseving is one way, but not the best way

Observations after investigating the datasets briefly:
- Some columns have the data type of 'object' while the correct data type is numeric.
- There exists some null values.
- Three datasets cover the same perspectives of three groups of screws. They can be dealt with together.

### 1.3 Combine DataFrames in pandas

There exists different ways of combining dataframews together:

- `merge( )` for combining data on common columns or indices
- `join( )` for combining data on a key column or an index
- `concat( )` for combining DataFrames across rows or columns

Note: No need to memorise specific functions, instead, understand there are different methods to combine data and Pandas enable these methods.

[More on types of merging join, merge concat](https://realpython.com/pandas-merge-join-and-concat/)

Here, we need `concat( )` function.   
The complete function:  
`pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)`
- `objs` means objects to be concatenated, series or dataframes
- multiple `objs` are specified with `[ ]`
- `*` indicate that all arguments following this must be specified using keyward arguments
- `axis = 0` means by default, the concatenation is performed along axis 0. But it can be modified by specifying `axis`
- combine dataframes with overlapping columns and return only those that are shared by passing `inner` to the `join` keyword argument
- for the left arguments, [link for other arguments in concat function](https://pandas.pydata.org/docs/reference/api/pandas.concat.html#pandas.concat)

£ Axis?
- `Axis = 0` means rows, `Axis = 1` means columns
- Along the axis operations are performed
- A more intuitive translation: what changes after the operation, the rows or the columns. The concatenations above povide good examples.
[Resource for axis usage](https://railsware.com/blog/python-for-machine-learning-pandas-axis-explained/)
- In the example of `concat()` here, by default `axis=0`, this means the result of `concat()` will have more rows than the inputs. Inputs of sizes (40,6), (23,6), (83,6) will give a result with the size of (40 + 23 + 83, 6)

???: Check all three dataset are actually concatenated
? How to check by `iloc` function? What indices three datasets range over in the concatenated dataframe?

??? Exercise: how to concatenate columns?
- Select the target columns from different dataframes, and put into a `[ ]`
- Set the concat axis to 1

### 1.4 Rename columns and set a column as row index labels
Practice of content so far
- `rename()` function
- `set_index()` function
- `inplace` argument

### 1.5 ££ Selection by boolean operation

When dealing with dataframes, a common task is to select rows with a column's value satisfying some criteria. For example, we want to select all screws with thread length bigger than 50.

To select rows based on boolean conditions:
- a series of boolean results regarding each row, `row conditions`. These boolean values label whether a row satisfies the criteria.
- input the series of boolean results into the dataframe to select in the format of `dataframe[row_conditions]`

## 2. Find and deal with duplicate values

### 2.1 Let's deal with duplicate IDs first

(An example of boolean operations)

The task to remove duplicates from a column:
- Find out how many unique indices there are
- Select rows with unique indices and remove duplicates

Relevant methods:
- `unique()`: return unique values
- `duplicated(keep = 'first')`: return boolean values denoting duplicate rows
    - `keep = 'first'` : Mark duplicates as True except for the first occurrence.

    - `keep = 'last'` : Mark duplicates as True except for the last occurrence.

    - `keep = False` : Mark all duplicates as True.

£££ Column selection and row selection creates a view of the original data.
- In Pandas, some operations copy a shallow copy of the original data, some operations create a deep copy of original data. This is because of dataframes are mutable. Whether creating a view or a deep copy is consistent with
  - `list_copy1 = list`
  - `list_copy2 = list[:]`

### 2.2 Next deal with duplicate rows
Relevant method:
- `duplicated()`
- `drop_duplicates(keep = 'first', inplace = False)`

? What does 12 mean out of 23?

£££ A warning will showup if you try to modify a shallow copy/ view of the original data. This is very dangerous and causing the warning is because:
- What you might want to do is only to change a shallow copy of the original data
- But what actually happens is not only the shallow copy changes, the original data also changes

## 3. Let's deal with badly formed IDs
- Find out rows with bad indices
- Modify the bad indices

Observe that there are indices that start with 'b', while the majority starts with 'B'.

### 3.1 `map( )` and `lambda( )`

`map( )` in pandas: `dataframe.map(function)`
- apply the function to every element of the dataframe
- [Pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.map.html#pandas.DataFrame.map)



** `map( )` in built-in functions of Python: `map(function, iterable)`
- iterable: simple examples are sequence type variables, lists etc.
- apply the function to every element of the iteratble
- [Python documentation](https://docs.python.org/3/library/functions.html#map)



** `lambda arguments : expression`
- define a function in one line. It's used where the function is short and no need for 'def'

![Img lambda func[link text](https://)tion](https://www.softwaretestinghelp.com/wp-content/qa/uploads/2021/02/fig1_lambda-expression.jpg)

- the arguments are claims of variables
- the expression define an operation of the variables

### 3.2 Method 1: step-by-step method

- Create a function that gives a list of boolean values regarding whether the ID starts with 'B', which serves an the row conditions.
- Modify accordingly

Test the function with a random input

Add this created list of boolean values as a column to the dataframe

Locate the rows with bad IDs

Vefity that there are no rows which does not start with 'B'

### * 3.3 Method 2: Nested Pythonic method

## 4. Correct data types

- `dataframe.dtypes()` check data types
- `dataframe.astypes()` change data types to target types

### 4.1 Fix datatype for one column

Step 1: observe the datatypes of different columns to check if there is any problems

We know that 'thread_length', 'thread_pitch','head_length', and 'diameter' are supposed to be numeric types.  

Step 2: Find out what datatypes are there in a column, and decide what to do

Step 3: The rows with 'float' type are fine, need to correct datatypes for the rows with 'str' type:
- find out how the string values are like

- use the function `pd.to_numeric(obj, errors = 'raise')` to convert `obj` to numeric types
    - `obj` means to transfer to numeric values for these objects
    - `errors` specify what to do with invalid cases
    - `errors = 'raise'`: raise an exception
    - `errors = 'ignore'`: return the input
    - `errors = 'coerce'`: set invalid cases as NaN


Step 4: Check for invalid cases generated from `errors = coerce` and fix
- use the method `isnull()` to find out rows containing null values

Step 5: Wrap up

Check if data types are floats, if not change the type to floats

### ??? 4.2 Excercise: fix datatype for 'head_length' column

## 5. Dealing with missing values
Common methods:
- replace using means
- replace using the most frequent occurences
- drop the rows wih missing values

Relevant functions:
- `isnull( )` or `isna( )` are equivalent
- `dataframe.isnull( )` and `dataframe.isna( )`  detect missing values, and return a mask of boolean values for each element in the dataframe

- `dataframe.any(*, axis=0)` returns whether any element is True, over the defined axis


???: Why here the `axis = 0` gives results about columns?

### 5.1 Replace using mean values for one column

### 5.2 Replace using frequencies for other columns

### 5.3 Drop data with missing values
If the stategy to deal with rows with missing data is to drop them. There are two options:
- `dataframe.drop(index=, inplace = True)`: drop named rows (can also be applied to drop columns)
- `dataframe.dropna(axis = 0, how ='any', subset = None, inplace = True)`: drop any row with any/all NaN values
  - `axis` specifies whether drop rows or columns
  - `how` specifies the condition is 'if any NaN exists' or 'if all of the values of Nan'
  - `subset` specifies which columns are of interest, for exampe, `subset = [thread_length, diameter]`


## 6. Identify outliers

- Statistical methods:
  - Simple sorting
  - Z-scores
  - IQR
- Visualisation methods:
  - Box plot
  - Histogram plot

### 6.1 Simple sorting
Cut out the extremes

### 6.2 IQR (interquartile range)

- Normal distribution  
![normal distribution](https://www.mathsisfun.com/data/images/normal-distrubution-large.svg)



![IQR explanation based on normar distribution](https://www.researchgate.net/publication/362595959/figure/fig3/AS:11431281078749803@1660236160899/Interquartile-range-and-empirical-rule-of-a-normal-distribution.png)

- The IQR boxplot  
    - Box: The box in a boxplot represents the Interquartile Range (IQR). It contains the middle 50% of the data.
    - The lines (whiskers) extending from the box represent the range of the data within a certain distance from the quartiles. They typically extend 1.5 times the IQR.
    - Individual data points beyond the whiskers are considered outliers.

### 6.3 Z-score
- Data points outside a range of data (e.g. 99.7%) are considered outliers.