# Session 4: Computing - Data Handling and Visualisation

This session will introduce key Python tools for data analysis and visualisation, focusing on fundamental concepts to support your learning in the unit. You will explore NumPy for handling numerical data, learn how to create basic plots, work with Pandas to manipulate datasets, and move on to advanced plotting techniques. The session is designed to build a strong foundation in Python, equipping you with essential skills for data analysis in future projects. Additionally, you will complete exercises tailored to specific engineering disciplines to apply the concepts you learn.

## NumPy

NumPy (Numerical Python) is a key library for performing numerical calculations in Python, designed to make working with large sets of numbers much easier and faster. It allows you to create and manipulate collections of numbers that are arranged in a structured format. For example, a simple list of numbers can be stored and managed more efficiently using NumPy. When these numbers are organised in rows and columns, they form what is called a matrix, which is useful for handling more complex data like images, signals, or large datasets. NumPy also includes a wide range of mathematical functions that allow for quick and efficient calculations on these collections of numbers, making it essential for scientific computing, data analysis, and engineering tasks. It integrates well with other Python libraries, such as **Pandas** for data handling, **SciPy** for scientific computations, and **Matplotlib** for creating graphs and visualisations.

### NumPy Module
NumPy is a “module” that contains:
- Data types
- Functions
- Constants


### How to Import NumPy
To use NumPy, the `import` statement must be used before calling any NumPy functions or data types. Typically, it is placed at the beginning of a Python program. An alias (typically shorter name) can be given using the "as" statement to facilitate writing code. You can only do this once in a notebook, but remember to run the cell before any other to avoid errors saying that `np` has not been defined.

In [1]:
import numpy as np

### NumPy functions

NumPy provides a number of mathematical functions that you can use:

| NumPy Function 	| Mathematical Equivalent 	| Comments 	|
|----------------	|-------------------------	|----------	|
| `np.pi`          	|    $\pi$                     	|          	|
| `np.e`          	|    $e$                    	|          	|
| `np.cos(x)`               	|      $\cos {x}$                   	| $x$ should be in radians          	|
| `np.sin(x)`               	|      $\sin {x}$                   	| $x$ should be in radians         	|
| `np.sqrt(x)`               	|      $\sqrt x$                   	|          	|
| `np.arctan(x)`               	|      $\arctan {x}$                   	|          	|
| `np.exp(x)`              	|          $e^x$               	|          	|
| `np.abs(x)`              	|          $\left\|x\right\|$               	|          	|

for example, let's calculate the $\sin$, $\cos$ and $\tan$ of $\pi/4$:

In [None]:
x=np.pi/4

print(f"for the angle Pi/4, sin={np.sin(x):0.3f}, cos={np.cos(x):0.3f} and tan={np.tan(x):0.3f}")

### NumPy Arrays

These are ordered lists (or grids, more on that in a bit) of numerical elements, all of the same type (i.e. float or int)

In [None]:
x=np.array([1.0,2.0,3.0,4.0])
print(x)

you can also use arrays to define grids (2 Dimensional or more) of numbers. See this example:

In [None]:
y=np.array([[1.0,2.0,3.0],[4.0,5.0,6.0]])
print(y)

There are some statements you can use to define NumPy arrays in specific shapes with ones or zeros. Have a look at the following:

In [None]:
np.zeros(shape=(10))

In [None]:
np.ones(shape=(5))

In [None]:
np.zeros(shape=(5,3))

In [None]:
np.linspace(0,10,21)

the last one `linspace` is a very useful way to define arrays for engineering applications. It stands for linear space. What it does is that it produces an array from the first parameter to the second parameter with a total of third parameter numbers. 

We can then do all sorts of mathematics on Arrays. Have a look at the following:

In [None]:
x1=np.array([np.sqrt(2)/2, np.sqrt(2)/2,1.0])
x2=np.array([np.sqrt(2)/2, -np.sqrt(2)/2,0.0])
y1 = 2.0*x1
z=x1+x2
w=x1*x2
y2=2.0+x1

print("x1=",x1)
print("x2=",x2)
print("y1=",y1)
print("z=",z)
print("w=",w)
print("y2=",y2)

## Plotting with MatPlotLib

**Matplotlib** is a widely used Python library for creating static, animated, and interactive visualisations. It offers a flexible framework for generating various types of plots, such as line charts, bar graphs, histograms, and scatter plots. Matplotlib is often utilised in scientific computing, data analysis, and engineering due to its ability to produce high-quality, publication-ready graphics.

Within Matplotlib, **pyplot** is a module that provides a simpler interface for creating plots with minimal code. It simplifies the process by managing figure creation, plot construction, and rendering, making it easier to generate quick visualisations without needing to handle the underlying complexities of Matplotlib directly.

Let's start with importing `pyplot`

In [10]:
import matplotlib.pyplot as plt

Here is an example of using the library to visualise $y=x^2$:

In [None]:

# Data for plotting
x = np.linspace(0,10,100)
y = x**2

# Create the plot
plt.plot(x, y)

# Display the plot
plt.show()


there are options we can use to modify the plot style. See this example:

In [None]:
# Let us change the line style and colour
plt.plot(x, y, linewidth = 3, linestyle = 'dashdot', color = 'black')

# Add a title and labels
plt.title('Example Line Plot')
plt.xlabel('Displacement (mm)')
plt.ylabel('Force (N)', fontsize=14)
plt.tick_params(axis="x", labelsize=12)
plt.ylim(0,150)

# Display the plot
plt.show()

### Exercise

plot the following graphs for $x=(0,2\pi)$ and format them to include axis labels:

$y=\sin(x)$

$y=\tan(x)$

$y=e^x$

$y=\log(x)$


In [13]:
# write your script here:


### Scatter plots

A scatter plot is a type of plot that displays individual data points as markers (such as dots) on a two-dimensional plane, with one variable plotted along the x-axis and the other along the y-axis. Scatter plots are useful for visualising the relationship between two variables, helping to identify patterns, correlations, or trends within the data.

Scatter plots are particularly important in data analysis because they allow us to:

* Identify relationships: Scatter plots can reveal whether two variables have a linear, non-linear, or no relationship at all.
* Detect outliers: Points that deviate significantly from the general pattern can be easily identified.
* Assess variability: They show how much variation exists in the data, both in terms of the spread of the points and clustering patterns.

Here is a simple example of creating a scatter plot using `pyplot`:

In [None]:
# Data for plotting
x = np.array([4, 10, 7, 5, 2, 0])
y = np.array([22.8, 55.5, 38.6, 28.5, 11.3, 7.2])

# Create a scatter plot
plt.scatter(x, y, color='red',label='my data')

# Add a legend and labels
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()

# Display the plot
plt.show()


### Saving figures for use elsewhere

You may wish to save your plots to use in reports, etc. You can do this by the `savefig` function:

In [None]:
# Data for plotting
x = np.array([4, 10, 7, 5, 2, 0])
y = np.array([22.8, 55.5, 38.6, 28.5, 11.3, 7.2])

# Create a scatter plot
plt.scatter(x, y, color='red',label='my data')

# Add a legend and labels
plt.xlabel('X')
plt.ylabel('Y')
plt.legend()

plt.savefig('example_scatter_plot.jpg', dpi=300, bbox_inches='tight') #dpi stands for dots per inch and indicates how fine a resolution python will use for saving the figure

**note** that you can replace `jpg` with `png` or `pdf` to save the image in different formats. The `bbox_inches='tight'` tells Python not to leave too much whitespace around the figure which is usually a good thing if you want to use it in a report or presentation. **Make sure you save the figures at a high resolution for your reports as precision in visualisation of information is critical in engineering communications**.

### Multiple plots on the same axis

We can create multiple plots on the same axis. Look at the following example which puts two line plots and a scatter plot on the same axis:

In [None]:
import matplotlib.pyplot as plt

# Data for plotting
x = np.linspace(1,5,15)
y1 = x**2  # y = x^2
y2 = np.random.randint(1,125,size=15)    # This is the NumPy syntax for generating an array of 15 random numbers between 1 and 125, more on this later
y3 = x**3 # y = x^3

# Create the figure and plot
plt.plot(x, y1, label='y = x^2', color='blue')  # First line plot
plt.plot(x, y3, label='y = x^3', color='green') # Second line plot
plt.scatter(x, y2, label='Scatter data', color='red') # Scatter plot

# Add title and labels
plt.title('Line and Scatter Plots on the Same Axis')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')

# Add a legend
plt.legend()

# Display the plot
plt.savefig("combined_plot.jpg",dpi=300,bbox_inches='tight')


## Pandas

`Pandas` is a powerful data manipulation and analysis library in Python, designed to make working with structured data both easy and efficient. It is particularly well-suited for handling large amounts of data due to its versatile data structures, such as DataFrames (two-dimensional, table-like data) and Series (one-dimensional data).

One of the key strengths of Pandas is its ability to read and write data from various formats, with just a few lines of code. This makes managing large datasets more straightforward, as Pandas automates much of the tedious work of loading, cleaning, and exporting data.

In scientific research and data analysis, datasets are often stored in **CSV (Comma-Separated Values)** format due to its simplicity and wide compatibility across different platforms and software. CSV files are plain text files that contain tabular data, making them ideal for storing large amounts of scientific measurements, experimental results, and sensor data.

For instance, in a study of wind power generation, one might track the generated power over time. This data could be stored in a CSV file with columns representing time, wind speed, power generated and possibly other variables such as temperature. So, for example data that looks like this:

| Time (minutes) | Wind Speed (m/s) | Rotor Speed (rpm) | Power Generated (kW) | Temperature (°C) |
|----------------|------------------|-------------------|----------------------|------------------|
| 0              | 10.9             | 14                | 2821.1               | 20.0             |
| 10             | 11.7             | 15                | 3507.2               | 20.3             |
| 20             | 9.5              | 12                | 1871.0               | 20.6             |
| 30             | 10.2             | 13                | 2276.3               | 20.9             |
| 40             | 10.4             | 13                | 2410.8               | 21.3             |
| ...            | ...              | ...               | ...                  | ...              |

will appear like this in the CSV file:

```csv
Time (minutes),Wind Speed (m/s),Rotor Speed (rpm),Power Generated (kW),Temperature (deg_C)
0,10.9,14,2821.1,20.0
10,11.7,15,3507.2,20.3
20,9.5,12,1871.0,20.6
30,10.2,13,2276.3,20.9
40,10.4,13,2410.8,21.3
...


now let's import `Pandas` and get started with using this data which has been provided to you in the `Wind_Turbine_Data_24h.csv` file.

In [17]:
import pandas as pd # this may produce a warning the first time you run it, you can safely ignore the warning for now.

Pandas uses `Dataframes` (a variable type that we have not yet seen) to store the data. We can read all of the data in the CSV file and create a dataframe with a single command:

In [18]:
wind_data=pd.read_csv("Wind_Turbine_Data_24h.csv")

Now we can test what we will get if we try to print this variable:

In [None]:
print(wind_data)

Dataframes have lots of useful functions that allow the data to be investigated. For example:

In [None]:
wind_data.head(5)

In [None]:
wind_data.tail(10)

In [None]:
wind_data.loc[27] #note the square bracket instead of the normal bracket

In [None]:
wind_data.shape

In [None]:
print(wind_data["Wind Speed (m/s)"].mean())

In [None]:
wind_data["Temperature (deg_C)"].max()

In [None]:
wind_data.median()

In [None]:
wind_data["Power Generated (kW)"]>2000

In [None]:
wind_data.mask(wind_data["Power Generated (kW)"]<2000)

In [None]:
wind_data.describe()

### Getting `NumPy` arrays from `Dataframes`

We can take a specific column from the dataframe and put it in a NumPy array. For example, we can put time and wind speed in two arrays like this:

In [32]:
wind_speed=wind_data["Wind Speed (m/s)"].to_numpy()
rotor_speed=wind_data["Rotor Speed (rpm)"].to_numpy()

now we can use these numpy arrays to do things like plotting:

In [None]:
import matplotlib.pyplot as plt

plt.scatter(wind_speed, rotor_speed, label='Speed', color='blue') # Scatter plot

# Add title and labels
plt.title('Wind and Rotor Speed')
plt.xlabel('wind speed (m/s)')
plt.ylabel('rotor speed (rpm)')

There seems to be a linear relationship between the two entities. So let's fit a line and plot that too. Remember that if we wanted to fit a line to data knowing that $(0,0)$ is on the line we can use the model $y=mx$ where $m=\frac{\sum{x_i,y_i}}{\sum{x_i^2}}$ and $x_i,y_i$ is the data point $i$. So let's calculate $m$:

In [None]:
xy=wind_speed*rotor_speed #this creates an nparray with each element the result of multiplication of the corresponding elements from the two nparrays
x2=wind_speed*wind_speed  #this creates an nparray with each element the result of multiplication of the corresponding element from the nparray by itself, squaring it.
sxy=xy.sum() #this adds all of the nparray together and reports the sum
sx2=x2.sum()
m=sxy/sx2

print(f"{m:.4f}")

and now we need to create a series of points on the $x$ axis and calculate the corresponding $y$ coordinate using our equation $y=mx$ to plot the points:

In [None]:
x=np.linspace(0,18,10)
y=m*x

plt.plot(x,y, label='approx Speed', color='red')

plt.title('Wind and linear approximation of Rotor Speed')
plt.xlabel('wind speed (m/s)')
plt.ylabel('linear approximation of rotor speed (rpm)')

Now let's put both on the same plot:

In [None]:
plt.scatter(wind_speed, rotor_speed, label='Speed', color='blue') # Scatter plot of data
plt.plot(x,y, label='approx Speed', color='red') # plotting the fitter line

# Add title and labels
plt.title('Wind and Rotor Speed')
plt.xlabel('wind speed (m/s)')
plt.ylabel('rotor speed (rpm)')

### Manipulating data in Dataframes

We can change the data in dataframes. For example, let's say we are interested in changing our wind turbine data so that we are storing the wind speed in miles per hour rather than meters per second. The following command adds a new column to my dataframe which will contain speeds in miles per hour. The formulat to convert speed from m/s to mph is: $ \text{m/s} = 2.23694 \, \text{mph}$

In [66]:
wind_data['Wind Speed (mph)'] = wind_data['Wind Speed (m/s)'] * 2.23694

let's now check the first 5 lines:

In [None]:
wind_data.head(5)

looks OK, so perhaps now we can delete the old column and make a new dataframe:

In [68]:
new_wind_data=wind_data.drop(columns=['Wind Speed (m/s)'])

let's check:

In [None]:
new_wind_data.head(5)

We can also make edits to the data in the Dataframe in place, without creating a new Dataframe. For example, let's say we want to round the wind speed to 1 decimal place:

In [None]:
new_wind_data['Wind Speed (mph)'] = new_wind_data['Wind Speed (mph)'].round(1)
new_wind_data.head(5)

### Writing CSV files

As you may expect, we can also write CSV files easily from dataframes. Let's save our mph wind speed data to a csv file called `wind_mph.csv`

In [60]:
new_wind_data.to_csv("wind_mph.csv",index=False) # check what happens when we do not include index=False

### Exercise

Write code to plot the power generated against wind speed. See if you spot a likely relationship and if you can think of a model $y=f(x)$ that may fit the data. Try and fit the model and then plot both together.

In [None]:
#write code here:

### Bonus question

How big is this wind turbine?