# Data Science for Manufacturing - Workshop 2-1: A deeper look into Python and programming, and the introduction to Pandas

##  Objectives


- 1. More Python and general programming basics:

  - lists, tuples, *dictionaries

  - £££ mutable and immutable data types

  - functions and loops

- 2. Introduction to common Python libraries

  - Pandas, Numpy, and more common Python libraries

  - Import a Python library and use the functions it contains (Obj 3)

- 3. Introduction to Pandas

  - Read in tabular data from a file

  - Investigate data

  - Indexing, slicing, selection

    - £ the `inplace` argument

  


## 1. Python and general programming essentials

### 1.1 More Python data types

#### 1.1.1 List (Review from last week content)

Lists are a common data structure to hold an ordered sequence of elements.
- Construct a list using square brackets '[ ]'
- Each element can be accessed by an index. Note that Python indices start from 0 instead of 1:
- There are methods specific to list types
- A list can be modified after creation

#### 1.1.2 Tuples

A tuple is similar to a list in the way that it’s  also a sequence of elements
- Construct a tuple using '( )' brackets
- Similar to lists, elements can be accessed by indices
- Tuples also have their methods
- A tuple cannot be modified after creation

**???: Tuples are so similar to lists, why make them two different data types?**

#### 1.1.3 * Dictionaries

A dictionary is a container that holds pairs of objects: keys and values, in the format of {key: value}
- Construct dictionaries with '{ }' curly brackets
- Each pair of key and value can be accesses by key

To add an item to the dictionary we assign a value to a new key:

### 1.2 £££ Mutable and immutable types

So far, we've covered several data types: integer, float, string, list, tuple. These data types can be differentiated regarding mutability.  
Immutable types:
- integer
- float
- string
- tuple  


Mutable types:
- list

Immutable types: data types once created, cannot be modified.  
Mutable types: data types keep evolving after being created.

#### 1.2.1 What happens when creating a variable

When creating a variable in Python, actually any kind of programming language, what happens is:
- a space in memory with a finite size is allocated to this variable based on the type of this variable
- the variable's name is a reference to the allocated memory space  

We can view `id( )` as a function that reports the address of an object in the memory

#### 1.2.2 What happens when modifying a variable

When modifying variables, mutable and immutable types behave differently

Observe that:  
Before and after modifications
- `int1` have different id values, i.e. different memory locations
- `list1` keep the same id value  

What actually happens is:  

For `int1`:  
- `int1` is integer type, which is immutable type. After creating it, its memory space is closed to changes.
- When trying to modify its value by `int1 =3`, actually the original `int1` is destroyed, and the original memory space allocated to it is also cleaned.
- After cleaning up the original `int1`, then a new variable with the same name `int1` is created at a new memory location.  

For `list1`:
- `list1` is list type, which is mutable type. When trying to modify its value, its memory space is open for changes.
- Therefore after modifications, memory space and id keep the same.

#### 1.2.3 What happens when passing values to another variable and changing the original data

Observe:
When creating a copy of a variable,  

for `int2` (immutable types):  
the copied variable only copy down when the copy command is executed, and does not change with the origin variable.  

for `list2` (mutable types):
the copied variable change with the origin variable.

<br>

Reasons for this different behaviours:
When writing a line of code `variable_copy = variable_origin`,

for immutable types:  
a new immutable variable is created, with the name being `variable_copy`, and value being the value of `variable_origin`  

for mutable types:  
a new reference is created, instead of a new variable. The new reference is linked to the memory space of `variable_origin`. As a result of this is just another reference to the same memeory space, when changing the value stored in the memeory space, the `variable_copy` refers to whatever is current.

<br>

If you want to copy down the value of a list, a mutable variable, instead of whatever happens in its memory space, use `list_copy = list[:]`

#### ??? 1.2.4 What happens when passing values to another variable and chaning the copied variables

### 1.3 Funtions and loops

#### 1.3.1 Functions

Functions are used when a section of code needs to be repeated at various different points in a program. It saves you re-writing it all.  

In reality you rarely need to repeat the exact same code. Usually there will be some variation in variable values needed. Because of this, when you create a function you are allowed to specify a set of parameters which represent variables in the function.

In our use of the print function `print(arguments)`, we have provided whatever we want to print, as a parameter. Typically whenever we use the print function, we pass a different parameter value.

Defining a section of code as a function in Python is usually done using the format:  
- the definition line:
    - `def` keyword
    - `function_name()`
    - `:` in the end  
- the finish line:
    - `return` to mark the end of the function
For example a function that takes two arguments and returns their sum can be defined as:

£ Note: On default, if a function returns not only a single value, and multiple values are returned seperated by commas, Python implicit creates a tuple containing those values.

#### 1.3.2 Loops

A `for` loop can be used to access the elements in a list (other iterables):

A `while` loop is used to execute a block of statements repeatedly until a given condition is satisfied.

#### 1.3.3 Conditionals

In Python, use conditionals like `if` to control whether or not a block of code is executed.

The primary conditional statements are `if`, `elif`, `else`.
Note: `elseif` is not recognised by Python.

<br>

Rules for constructing conditional statements:
- First line opens with `if` and ends with a colon.
- Body containing one or more statements is indented (usually automatically formatted in any Python debelopment environment). Indentation is very important in Python.
    

Conditionals are often used inside loops

Use `else` to execute a block of code when an if condition is not true

Use `elif` to specify additional tests

Order is important. Conditions are executed in order.

### 1.4 A useful tool in notebook: bash commands

Bash commands can be excecuted directly in notebook cells.
- This does not apply to other Python development environments, it's specific to notebooks.
- Start bash commands with '%'

## 2. Libraries

[Common libraries cheatsheet](https://www.python-graph-gallery.com/cheat-sheets/)

[Pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

The term "library" is used to describe a code collection, which contains functions or precompiled codes that can be used later on in a program for some specific well-defined operations.

- 'Package' is usually referring to similar meanings when talking about Python. They can be used interchangeably.

- The Python Standard Library is already an extensive suite of modules that comes with Python itself.
  - Examples of libraries within Python Standard Library are: `os`, `math`, `math`, `datetime`.
- Many additional libraries are available from PyPI (the Python Package Index).
  - Easy distribution by `pip`.

<br>

Reasons that library developments are important:
- Code reusability: catch the repeated patterns
- Abstraction: relieve people from knowing everything and every detail
- Modularity, Standardization, Collaboration
- Specialised functionality
  - In some job ads, some Python libraries, such as Tensorflow and PyTorch, are specified to emphasize the aimed expertise.
- ...
- All these feature combined makes one of the main reasons why Python gained such popularity.

<span style="color:red">**A program must import a library module before using it**</span>

- £ Import all the functions in the beginning of a program is usually a good idea

`Pip` can be used to check all the installed libraries

### 2.1 Numpy

- What it does: Provides access to N-dimensional arrays and support for performing intensive mathematical as well as scientific calculations
- Objects: arrays of any dimension
- Functionalities: comprehensive mathematical functions, linear algebra, Fourier transform, etc
- Denpendencies: Python

### 2.2 Pandas

- What it does: Provides access to efficient data structures for structured and time-series data. Pandas is a widely-used Python library for statistics, particularly on tabular data.
- Objects: two-diemnsional tabular data (excel data being an example)
- Functionalities: tabular data manipilation and analysis
- Denpendencies: numpy

<br>

? Think why Python is not listed here in Pandas dependency

#### 2.2.1 Numpy VS Pandas

Main differences: they are aimed for different types of data

![numpy-and-pandas.png](https://github.com/dsmanufacturing/dsmanufacturing.github.io/blob/master/images/Screenshot%202022-01-27%20at%2002-57-32%20Difference%20between%20Pandas%20VS%20NumPy%20-%20GeeksforGeeks.png?raw=true)   
Source: GeeksforGeeks

### 2.3 Matplotlib

- What it does: Helps developers create stunning visualizations
- Objects: basic python data types and also numpy arrays
- Functionalities: visulisations
- Denpendencies: numpy

### 2.4 Seaborn

- What it does: Seaborn is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with pandas data structure
- Objects: numpy arrays and also pandas dataframes
- Functionalities: statistical visulisations
- Denpendencies: matplotlib, numpy, pandas

### 2.5 Scikit-learn

- Objects: basic python data types and also numpy arrays
- Functionalities: machine learning
- Denpendencies: numpy, scipy.


Note: Scikit-learn is espically good for simple machine learning algorithms, but not as good for deep learning.

### 2.5 Tensorflow and Keras, PyTorch

All these three packages are popular machine learning frameworks. In most cases, tensorflow and pytorch have overlap functionalities. Keras is specific API in tensorflow. In this course, we'll use tensorflow and keras to demonstrate these frameworks.   


Tensorflow:  

- Objects: high-dimensional tensors and arrays
- Functionalities: machine learning framework for neural network definition, layer customisation, model training, etc
- Denpendencies: numpy, keras

- Objects: high-dimensional tensors and arrays
- Functionalities: high-level deep learning framework
- Denpendencies: none particular (still a Python library)

## 3. Introduction to Pandas




### 3.1 Pandas data structures, series and dataframes

#### 3.1.1 Series

Series is an one-dimensional labeled array.
- important attribute: index


#### 3.1.2 DataFrames

DataFrame is a two-dimensional labeled data structure.
- important attribute: index and column names


#### 3.1.3 Relationships between series and dataframes

- Any row or column of a dataframe is a series
- A series can be used to construct a dataframe
    - When using series to create dataframes, two attributes of dataframes should be specified. If not specified, they will be automaticalled created.

Observe: although ss and df1 seem to have the same content, the dimensions are different because of the nature of series and dataframes


### 3.2 Basic functions

#### 3.2.1 Load tabular data using pandas

To begin processing data, we need to load it into Python. We can do that using the library pandas.

- Load it with `import pandas as pd`. The alias pd is commonly used for Pandas.
- Read a Comma Separated Values (CSV) data file with `pd.read_csv`.
    - Argument is the name of the file to be read.
    - Assign result to a variable to store the data that was read.


#### 3.2.2 Investigate and inspect the data

![Dataframe schematic](https://pynative.com/wp-content/uploads/2021/02/dataframe.png)

£££ Terms for dataframes:
- Column names (labels) VS column indices
- Index labels VS row indices

#### 3.2.3 Select a column
Options:
- dot method: use column names
- square bracket method: use column names
- iloc method: use column indices
- loc method: use column names

Summary:
When using `loc` or `iloc` methods, the format follows: (`df.loc[:, 'thread.length']` as an example)
- square brackets []
- inside [], use `,` to seperate rows and columns
- `:` means all content along this axis is selected. In the example, `:` means all rows are selected

#### 3.2.4 Selet a row
Options:
- loc method: use row indices (row names)
- iloc method: use row indices (numbering)

£££ Note: the '2' in loc method and iloc method are actually different. The 2 in loc method is because the index labels happen to be numbering from 0 to 149, which is not always the case. Example shown in below:

#### 3.2.5 Selet a subset of the dataframe

#### 3.2.6 Select single entries

#### 3.2.7 Rename columns

££ in-place: whether the original data saved in memory space is changed

`axis = 1` here means do the modification along axis 1, which is the columns.
More details about `axis = 1` to be covered next week

#### 3.2.8 Use a column as `index`

Another example of `inplace`

??? How to use iloc method to select a row now that the index has been changde?