[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/databyjp/axi_da_transform_demos/blob/main/DAT2_Week_3_2.ipynb)
# DA Transform - Week 3 session 2

## Review

### Numpy

#### The basics

In [None]:
import numpy as np

An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element.

[Numpy docs](https://numpy.org/doc/stable/user/absolute_beginners.html#what-is-an-array)

#### Numpy array example

![image.png](https://miro.medium.com/max/1200/1*sxnhgeSptW8Jfol8XUyP-Q.png)

Let's see what happens as we vary the **number of dimensions (aka *axis*)** and **number of elements along that dimension**.


In [None]:
print('1_D array')
print(np.zeros([3]))
print('2_D array')
print(np.zeros([2, 2]))
print('3_D array')
print(np.zeros([2, 2, 2]))
print('4_D array')
print(np.zeros([2, 2, 2, 2]))
# To get the 'shape' of an array - i.e. number of dimensions/axes and the number of items along each axis
print(np.zeros([2, 2, 2, 2]).shape)

#### Indexing an ndarray

![NP indexing](https://scipy-lectures.org/_images/numpy_indexing.png)

In [None]:
a = np.array([[10 * j + i for i in range(6)] for j in range(6)])  # Nested list comprehensions are very difficult to read. For a challenge, try re-creating this with nested *for* loops
print(a)

Let's try these indexing operations shown above:

[Numpy docs on indexing](https://numpy.org/doc/stable/user/basics.indexing.html#slicing-and-striding)

In [None]:
a[2:4, 3:5]  # Try others

Let's try updating individual elements by their index

In [None]:
a[1, 3] = 20   # Change this value

In [None]:
a[2:4, 3:5]  # You can even select multiple elements and replace their values like so:
a

#### Numpy Lab *(Level Up: Using NumPy to Parse a File)*

### Covariance and correlation

Go through Covariance and Correlation - Lab:
- Mean normalisation
- Variance

## New material

### Data visualisation

#### Why visualise data?

![Datasaurus dozen](https://damassets.autodesk.net/content/dam/autodesk/research/publications-assets/images/AllDinosGrey_1.png)

[Source: Autodesk](https://www.autodesk.com/research/publications/same-stats-different-graphs)

#### matplotlib

One of *many* visualisation libraries availble for Python

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
x = np.linspace(0, 10, 9)
y = x * 2 + 1

In [None]:
print(x)
print(y)

##### Let's try building some simple plots

In [None]:
# This is the type of syntax you see in the notes
# Create the plot
fig, ax = plt.subplots()

# Graph X vs. Y as a scatter plot
ax.scatter(x, y)

# # Set title
ax.set_title("Scatter Plot in Matplotlib")

# # Set labels for X and Y axes
ax.set_xlabel("X Values")
ax.set_ylabel("Y Values")

# # Set text of legend
ax.legend(["Function: $sin(x)$"])

##### We can also use a syntax like this:

In [None]:
plt.plot(x, y, 'bo')

# Set title
plt.title("Scatter Plot in Matplotlib")

# Set labels for X and Y axes
plt.xlabel("X Values")
plt.ylabel("Y Values")

# Set text of legend
plt.legend(["Function: $sin(x)$"])

But... matplotlib library recommends using the
```
fig, ax = plt.subplots()
```
syntax, as written here (https://matplotlib.org/stable/tutorials/introductory/usage.html#sphx-glr-tutorials-introductory-usage-py)

##### Check out the docs

- Basic usage: https://matplotlib.org/stable/tutorials/introductory/usage.html
- `plt.subplots`: https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html


##### Note on the Lab - Exercise 1

It asks you to:
* Create a new figure object `fig` using `.figure()` function.
* Use `add_axes()` to add an axis `ax` to the canvas at absolute location [0,0,1,1].

In [None]:
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])  # https://matplotlib.org/stable/api/figure_api.html#matplotlib.figure.Figure.add_axes

In [None]:
fig, ax = plt.subplots()

##### Let's go through Exercise 2

### Data analysis in base python: 

#### Note: List comprehensions

```apple_tree_yields = [float(x["yield"]) for x in apple_orchard_data]```

In [None]:
# Let's try some examples
tmp_list = list()
for i in range(4):
  tmp_list.append(i)
print(tmp_list)

In [None]:
tmp_list = [i for i in range(4)]
print(tmp_list)

#### File Input and Output in Python

- Locating the file
- Opening the file using appropriate software
- Reading and/or editing contents of the file
- Closing the file

#### File paths

In [None]:
! ls  # To see the contents of the current working directory

In [None]:
! ls ./sample_data  # To see what is in the 'sample_data' subdirectory

#### Why opening / closing matters

- like borrowing a book and keeping it in your inventory
- Read more: https://stackoverflow.com/questions/25070854/why-should-i-close-files-in-python

In [None]:
f = open('./sample_data/README.md')
txt = f.read()
header = 'The file contents are:\n'.upper()
print(header, txt)
txt = f.read()
print(header, txt)  # Notice what happens here - nothing gets printed, as Python is now looking at the end of the file "object". Also, the file remains "open".

#### Idiomatic code for opening & operating on file

- `with open(file_path) as file_obj`:
- f.readlines() v f.read()

Check output types

In [None]:
with open('./sample_data/california_housing_test.csv', 'r') as f:
  txt = f.readlines()

header = 'The file contents are:\n'.upper()
# print(header, txt)
# print(type(txt))
# print(txt[:5])
for l in txt[:5]:
  print(l)

In [None]:
# Code

#### File Types

- .csv
- .tsv
- .json
- .xml
- binary files

Let's take a look at examples of file types
(you can find more examples on Kaggle)

#### Accessing file types with built-in libraries

In [None]:
import json
with open('./sample_data/anscombe.json') as f:
  data = json.load(f)
with open('./sample_data/anscombe.json') as f: 
  raw_data = f.read()

# See how the json.load function loads our data nicely
print(raw_data[:200])
print(data[:5])


In [None]:
import csv
with open('sample_data/california_housing_train.csv', 'r') as f:
  # data = list(csv.reader(f))  
  data = list(csv.DictReader(f))
data[:5]  # Compare the two outputs between reader and DictReader