# DAML4  notes
## Week 1 - Data modalities

<hr style="border:2px solid black"> </hr>

## What are these notebooks for?

Each week of the course has a lecture and a lab. In the lecture you are introduced to material, and in the lab you use Python to solve problems using that material.

These notebooks act as a bridge between the lecture and the lab. Each one summarises the lecture, and provides code examples as a starting point for writing your own code. I will be importing packages on-the-fly as needed so you can see exactly when they are required. In the labs, imports will be done at the start.

The first lecture was slightly unusual, in that it was also an introduction to the course. I am not going to summarise the introduction (whatever that would entail) but I am going to look at different data modalities, and how you can process them in Python.

## Data modalities

### Time series 

In time series data, we have some quantity we care about at different points in time.   We are going to consider the value of the pound (GBP) vs. the dollar (USD). A lot of the time we import our data from spreadsheets (e.g. excel files, CSV files) and this is no exception!

I have already downloaded the spreadsheet containing GBP vs. USD values from 04/10/2021 to 03/10/2022 from [Yahoo Finance](https://finance.yahoo.com/quote/GBPUSD%3DX%3B/history?period1=1633265701&period2=1664801701&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true) which is in CSV format. 

How do we get this into Python? There is a fantastic library called pandas which does all the hard work for us.





In [None]:
# Import pandas for dataframes
import pandas as pd

# Read CSV into dataframe using pandas
df = pd.read_csv("data/GBPUSD=X.csv")

# Show dataframe
df

Let's say we care about the closing value at the end of each day. This is in the `Close` column. We can plot this against `Date` using a few lines of code.

In [None]:
# matplotlib for plotting
import matplotlib.pyplot as plt

# This makes matplotlib output nice figures without much tweaking
plt.rcParams.update(
    {
        "lines.markersize": 10,  # Big points
        "font.size": 15,  # Larger font
        "xtick.major.size": 5.0,  # Bigger xticks
        "ytick.major.size": 5.0,  # Bigger yticks
    }
)

# Plotting code
fig, ax = plt.subplots(figsize=[8,6])
ax.plot(df["Date"], df["Close"])
ax.set_xlabel('Date')
ax.set_ylabel('GBP to USD rate')
ax.set_title('GBP vs USD from October 2021-2022')


# Formatting specifically for date strings
fig.autofmt_xdate()


# A hacky way of showing a subset of x ticks
dates = df["Date"].to_numpy()
ax.set_xticks([dates[i] for i in range(0, len(dates), 50)])


We can also take a column and convert it into a numpy array. Let's take the `Close` column, convert it into an array, and find its mean. (This can be done directly in pandas but is for illustration!)

In [None]:
# import numpy for arrays
import numpy as np

close_array = df["Close"].to_numpy()

close_mean = np.mean(close_array)
print(f"The mean close value is {close_mean:3f}")

### Tabular data

We actually did just extract our time series data from tabular data! Tabular data will make up the vast majority of data you look at in this course. It is very common in the real world. 

Let's say we want the list of dates for which the opening value of GBP vs. USD was greater than 1.3.

First, we can see which rows this corresponds to:

In [None]:
valid_rows = df["Open"] > 1.3
print(valid_rows)

We can then use this boolean array to filter out the appropriate dates.

In [None]:
df["Date"][valid_rows]

If we want to get the rows that correspond to dates in the year 2022, we can use the following to get a Boolean array, and filter with it.

In [None]:
valid_rows = df["Date"].str.contains("2022")
df[valid_rows]

### Images

PIL (the Python imaging library) lets us read in images, as well as perform high-level manipulations. We will load in a JPEG of a dog, and also print image size in pixels.

In [None]:
# Image from PIL lets us manipulate images
from PIL import Image

# Read image
image = Image.open("data/dog.jpg")

# Use Juypter's inbuilt display function
display(image)
print(f"The image has a size of {image.size}")

We can downsize the image...

In [None]:
image = image.resize((224, 224))
display(image)

and rotate it!

In [None]:
image = image.rotate(90)
display(image)

Recall from the lectures that an image is actually stored as a 3D array (height by width by colour channel). We can see this if we convert our image into a numpy array.

In [None]:
# Convert image into a numpy array
data_im = np.array(image)

# See what the array looks like and print its shape
print(data_im)
print(f"the image in numpy has shape {data_im.shape}")

Let's manually set the red colour channel to zero for all pixels and see what happens.

In [None]:
# Create a copy of our image array
data_nored = data_im.copy()

# Set all values in the red channel to 0
data_nored[:, :, 0] = 0

# Now we have a numpy array we have to use matplotlib to display it
plt.imshow(data_nored)

Without any red, the blue and green have got more ... blue and green. This wasn't the most obvious manipulation! Try some others by changing the code above.

<hr style="border:2px solid black"> </hr>

#### Written by Elliot J. Crowley and &copy; The University of Edinburgh 2022-23