In [2]:
# SETUP CODE - PlEASE RUN THIS ONCE WHEN YOU STARTUP YOUR CODESPACE

# RUN TEST FILE
%run test/week3_test.ipynb

# **Week 3 - Numpy, Pandas and DataFrames**

# Numpy & Mathematics in Python

NumPy, short for Numerical Python, is an essential library for scientific computing in Python. It provides support for large multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.

- Efficiency: NumPy arrays allow for faster operations than Python lists, especially for large data sets.
- Functionality: NumPy provides a wide range of mathematical and statistical functions.
- Convenience: It supports an array-oriented programming style, which simplifies many kinds of data manipulation tasks.




### NumPy Arrays
The core of NumPy is the ndarray object, representing a multidimensional, homogeneous array of fixed-size items.



In [9]:
import numpy as np

In [2]:
# From a Python list
arr = np.array([1, 2, 3])

# Multidimensional array
multi_arr = np.array([[1, 2, 3], [4, 5, 6]])

print(multi_arr)

# Arrays of zeros and ones
zeros = np.zeros((2, 3))
ones = np.ones((3, 2))



[[1 2 3]
 [4 5 6]]


### Array Operations
NumPy provides a variety of operations that can be performed on arrays.

In [3]:
# Element-wise addition
sum_arr = arr + arr

# Element-wise multiplication
prod_arr = arr * arr

# Matrix product
matrix_prod = np.dot(multi_arr, ones)


#### Array Indexing and Slicing
NumPy arrays can be sliced and indexed similar to Python lists

In [4]:
# Accessing elements
element = arr[0]

# Slicing
sub_array = multi_arr[0:2, 1:3]


#### Useful NumPy Functions
NumPy provides many functions that are useful for statistical and mathematical operations.

In [5]:
# Mean and standard deviation
mean = np.mean(arr)
std_dev = np.std(arr)

# Sum and product
arr_sum = np.sum(arr)
arr_prod = np.prod(arr)

# Transpose
transposed = multi_arr.T

# triginometric functions
np.sin(np.pi / 2)
np.cos(np.pi / 2) # <-- Observe FLOATING POINT ERROR - 6.123233995736766e-17

np.float64(6.123233995736766e-17)

# **Challenge Question 1**
Catastrophic cancellation is a problem in numerical computing where significant digits of precision are lost due to the subtraction of two nearly equal numbers. This loss of precision can lead to highly inaccurate results, especially in floating-point computations.

### Example of Catastrophic Cancellation in Python
Let's consider a mathematical problem where catastrophic cancellation can occur. Suppose we want to compute the value of the expression sqrt(x + 1) - sqrt(x) for a very large x. Theoretically, as x becomes very large, this expression should approach zero. However, due to floating-point precision issues, the result can be inaccurate.

Here's how this can be implemented and demonstrated in Python:

In [None]:
def unstable_calculation(x):
    return np.sqrt(x + 1) - np.sqrt(x)


### Explanation and Alternative Approach
When subtracting two nearly equal numbers, many of the leading digits cancel out, and the difference is determined by the less significant digits, which are less accurately known. This leads to a result that can be significantly off from the true value.

To avoid catastrophic cancellation, one approach is to reformulate the problem to avoid direct subtraction of nearly equal numbers. Using algebraic manipulation or an alternative formula that is mathematically equivalent but numerically more stable can often help.

For the example above, we can use the mathematical identity:

$$
\sqrt{a} - \sqrt{b} = \frac{(\sqrt{a} - \sqrt{b})(\sqrt{a} + \sqrt{b})}{\sqrt{a} + \sqrt{b}} = \frac{a - b}{\sqrt{a} + \sqrt{b}}
$$

#### Applying this identity, we can rewrite the function in a more stable form:

$$
\frac{(x + 1) - (x)}{\sqrt{x+1} + \sqrt{x}} = \frac{1}{\sqrt{x+1} + \sqrt{x}}
$$

In [None]:
def stable_calculation(x):
    # implement your stable function code here
    return 1


In [None]:
# run tests to confirm your stable calculation function
test_stable_calculation()

In [None]:
# graph of stable vs unstable
import matplotlib.pyplot as plt

# Generate a range of x values
x_values = np.linspace(1, 1e12, 10000)

# Calculate the absolute difference between stable and unstable calculations
difference = [abs(stable_calculation(x) - unstable_calculation(x)) for x in x_values]

# Plotting the results
plt.figure(figsize=(10, 6))
plt.plot(x_values, difference, label="Absolute Difference", color='blue')
plt.xscale('log')  # Using logarithmic scale for x-axis
plt.yscale('log')  # Using logarithmic scale for y-axis
plt.xlabel('x value')
plt.ylabel('Absolute Difference')
plt.title('Difference between Stable and Unstable Calculations')
plt.legend()
plt.show()


This example is a great way to show that computational mathematics is not always just a lift and shift operation (ie. it's not just as simple as picking up the equation and typing it into python). There are a number of considerations that come into play such as numerical stability as seen in this Challenge Question.

# **Challenge Question 2**
Implement the following set of functions using NumPy:

1. **create_array(x):**
    - Takes a single integer `x` as input.
    - Returns a NumPy 2D array of shape `(x, x)` filled with random integer values between 1 and 100 (inclusive).
    
2. **slice_array(arr, slice_size):**
    - Takes as input a 2D NumPy array `arr` (created by `create_array`) and an integer `slice_size`.
    - Returns the central region of `arr` with shape `(slice_size, slice_size)`.
    - If there are any impossible computations then return `None`. For example, if `arr` isn't square, or if the user tries to find an even-sized central region inside an odd-sized array. 

3. **column_sum(arr):**
    - Takes a 2D NumPy array `arr` as input.
    - Returns a 1D NumPy array containing the sum of each column in `arr`.

4. **normalise(arr):**
    - Takes a 2D NumPy array `arr` as input.
    - Returns a new array where each value is normalised in the range `[0, 1]`, using min-max normalisation. Hint: Look up how to calculate min-max normalisation
    - If the min and max are the same, then return an array of ones of the same dimensions as the input


In [3]:
def create_array(x:int) ->np.ndarray:
    pass

def slice_array(arr:np.ndarray, slice_size:int) -> np.ndarray:
    pass

def column_sum(arr:np.ndarray) -> np.ndarray:
    pass

def normalise(arr:np.ndarray) -> np.ndarray:
    pass

In [4]:
test_numpy_challenge()

[91mTest 1 Failed/Skipped: Create 5x5 array. create_array function is not implemented[0m
[91mTest 2 Failed/Skipped: Create 1x1 array. create_array function is not implemented[0m
[91mTest 3 Failed/Skipped: Central 3x3 slice from 5x5 array (odd/odd). slice_array returned None but a valid slice was expected[0m
[91mTest 4 Failed/Skipped: Central 2x2 slice from 4x4 array (even/even). slice_array returned None but a valid slice was expected[0m
[92mTest 7 Passed: Odd array (5x5), even slice_size=2 -> None[0m
[92mTest 8 Passed: Even array (4x4), odd slice_size=3 -> None[0m
[92mTest 9 Passed: slice_size larger than array (5x5, slice=6) -> None[0m
[92mTest 10 Passed: Non-square array (3x4) -> None[0m
[91mTest 11 Failed/Skipped: Slice equal to array size (6x6, slice=6). slice_array returned None but a valid slice was expected[0m
[91mTest 12 Failed/Skipped: Central 4x4 from 8x8. slice_array returned None but a valid slice was expected[0m
[91mTest 13 Failed/Skipped: Central 5x5

TypeError: 'NoneType' object is not subscriptable

## **Numpy Cheat Sheet**
![numpy-cheat-sheet](../resources/images/week3/numpy_cheat_sheet.jpg)

# **Pandas DataFrames**

Pandas is a popular Python library for data manipulation and analysis. Its primary data structure is the DataFrame, a 2-dimensional labeled structure, ideal for handling various data types and complex data operations.

### Key Features of Pandas DataFrames
- Handling Different Data Types: Supports columns with diverse data types.
- Size Mutability: Easy addition and deletion of columns.
- Labeling Data: Clear labelling of rows and columns.
- Advanced Data Operations: Offers a wide range of functions for data manipulation, including filtering, grouping, and pivoting.

### Creating DataFrames
DataFrames can be created from different data structures like dictionaries, lists, or numpy arrays. We can think of dataframes like tables or excel sheets that have rows and columns.


In [10]:
# import pandas
import pandas as pd

In [17]:
# from a dictionary
data = {'Name': ['Alice', 'Bob', 'Chris', 'John'], 'Age': [25, 30, 35, 48]}
dict_df = pd.DataFrame(data)

# from a list
names = ['Alice', 'Bob', 'Chris', 'John']
ages = [25, 30, 35, 48]
list_df = pd.DataFrame(list(zip(names, ages)), columns=['Name', 'Age'])

print('Dictionary Dataframe\n', dict_df, '\n\n')
print('List Dataframe\n', list_df)

df = dict_df

Dictionary Dataframe
     Name  Age
0  Alice   25
1    Bob   30
2  Chris   35
3   John   48 


List Dataframe
     Name  Age
0  Alice   25
1    Bob   30
2  Chris   35
3   John   48


### Reading and Writing Data
Pandas supports various file formats like CSV, Excel, and JSON.

In [None]:
# read csv
# df = pd.read_csv('data.csv') <- we don't want to run this code since we don't have a file called data.csv in this directory

# write to csv
# df.to_csv('output.csv', index=False) 

### Viewing Data

In [8]:
df.head()  # First 5 rows
df.tail()  # Last 5 rows

Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Chris,35


### Selecting Data with loc and iloc
- loc: Selects data by labels/index.
- iloc: Selects data by integer position.

In [18]:
# Selecting a single row by index
row = df.loc[0]

# Selecting a range of rows
rows = df.iloc[1:3]

# Selecting specific columns
ages = df.loc[:, 'Age']
name_age = df.iloc[:, [0, 1]]

### Filtering Data

In [19]:
over_30 = df[df['Age'] > 30]

print(over_30)

    Name  Age
2  Chris   35
3   John   48


### Adding and Deleting Columns

In [20]:
df['Country'] = 'USA'  # Add new column

print(df, '\n')

del df['Country']      # Delete column

print(df)


    Name  Age Country
0  Alice   25     USA
1    Bob   30     USA
2  Chris   35     USA
3   John   48     USA 

    Name  Age
0  Alice   25
1    Bob   30
2  Chris   35
3   John   48


### Advanced DataFrame Features

#### Groupby

In [23]:
# construct complimentry df
employee_data = {
    'Name': ['Alice', 'Bob', 'Chris', 'John'],
    'Department': ['Finance', 'IT', 'IT','Finance'],
    'Salary': [70000, 65000, 80000, 50000],
    'City': ['Brisbane', 'Brisbane', 'Cairns', 'Cairns']
    }

employee_df = pd.DataFrame(employee_data)

# group by 
grouped = employee_df.groupby('City')

print('Grouped DataFrame\n')

# Iterate through groups
for name, group in grouped:
    print(f"City: {name}")
    print(group, "\n")

Grouped DataFrame

City: Brisbane
    Name Department  Salary      City
0  Alice    Finance   70000  Brisbane
1    Bob         IT   65000  Brisbane 

City: Cairns
    Name Department  Salary    City
2  Chris         IT   80000  Cairns
3   John    Finance   50000  Cairns 



### Merging and Joining
To be discussed in more detail next week

In [24]:


# merging and joining
merged_df = pd.merge(df, employee_df, on='Name')

print('\n\nMerged Dataframe\n', merged_df)




Merged Dataframe
     Name  Age Department  Salary      City
0  Alice   25    Finance   70000  Brisbane
1    Bob   30         IT   65000  Brisbane
2  Chris   35         IT   80000    Cairns
3   John   48    Finance   50000    Cairns


### Aggregate Data

In [25]:
# Aggregate Data on Grouping
aggr_df = merged_df.groupby("City").agg({'Salary':['min','mean'], 'Age':'mean'})
print(aggr_df)

         Salary            Age
            min     mean  mean
City                          
Brisbane  65000  67500.0  27.5
Cairns    50000  65000.0  41.5


### Pivot

![pandas-pivot](../resources/images/week3/pandas_pivot.png)

In [1]:
# pivot tables
pivot = merged_df.pivot_table(values=['Age', 'Salary'], index='Department', aggfunc='mean')

print('\nPivoted Dataframe\n', pivot)

NameError: name 'merged_df' is not defined

These operation are used by data engineers to merge, transform and mold certain data into various shapes for different purposes. The concepts are the same across python and SQL (Sructured Query Language) so the skills are all very transferrable.

# **Pandas Data Wrangling Cheat Sheet**
![pandas-data-wrangling-1](../resources/images/week3/pandas_datawrangling_1.PNG)
![pandas-data-wrangling-2](../resources/images/week3/pandas_datawrangling_2.PNG)

# **Challenge Question 3**

In this Challenge Question, you are provided with a scenario that involves two sets of data represented as Pandas DataFrames in Python. The first DataFrame, customers_df, contains customer data for a hypothetical energy company, while the second DataFrame energy_usage_df contains each customer's hourly energy usage data. The goal is to merge and analyze these datasets to gain insights into customer energy usage patterns.

In [11]:
# first lets setup some random data for our data frames

# Sample customer data for 10 customers
customer_data = {
    'NMI': [f'NMI{100 + i}' for i in range(10)],
    'Name': [f'Customer {i}' for i in range(1, 11)],
    'Address': [f'{i} Some St' for i in range(100, 110)],
    'Age': [20 + i for i in range(10)]
}

customers_df = pd.DataFrame(customer_data)

# set np random seed which makes sure that random data is always the same for testing purposes
np.random.seed(0)

# Generate sample energy usage data for 10 customers
hours = pd.date_range('2023-01-01', periods=24, freq='h')
nmis = [f'NMI{100 + i}' for i in range(10)]
energy_data = {
    'NMI': np.repeat(nmis, 24),
    'Hour': hours.tolist() * 10,
    'kWh': np.random.rand(24 * 10) * 10  # Random kWh values for each hour
}

energy_usage_df = pd.DataFrame(energy_data)


In [12]:
# lets list the customer data to see the rows columns

customers_df.head()

Unnamed: 0,NMI,Name,Address,Age
0,NMI100,Customer 1,100 Some St,20
1,NMI101,Customer 2,101 Some St,21
2,NMI102,Customer 3,102 Some St,22
3,NMI103,Customer 4,103 Some St,23
4,NMI104,Customer 5,104 Some St,24


In [13]:
# lets list the energy usage data to see the rows and columns
energy_usage_df.head()

Unnamed: 0,NMI,Hour,kWh
0,NMI100,2023-01-01 00:00:00,5.488135
1,NMI100,2023-01-01 01:00:00,7.151894
2,NMI100,2023-01-01 02:00:00,6.027634
3,NMI100,2023-01-01 03:00:00,5.448832
4,NMI100,2023-01-01 04:00:00,4.236548


#### Explanation of the Customers Dataframe

The customers_df DataFrame contains information about customers. Each row in this DataFrame represents a unique customer, with details about their identity and demographic information. Here's what each column represents:

- NMI (National Meter Identifier): This is a unique identifier for each customer. It's a string that starts with 'NMI' followed by a number (e.g., 'NMI100', 'NMI101', etc.). This identifier is crucial for linking customers with their respective energy usage data.

- Name: This column contains the name of the customer. In this dataset, customers are named in a sequence (e.g., 'Customer 1', 'Customer 2', etc.), indicating their order or position in the dataset.

- Address: The address of each customer is listed here. Addresses are fictional and follow a numerical sequence (e.g., '100 Some St', '101 Some St', etc.). They provide a location context for each customer.

- Age: This column shows the age of each customer. Ages are numeric values starting from 20 and increasing sequentially by 1 for each customer (e.g., 20, 21, 22, etc.).


#### Explanation of the Energy Usage Dataframe

The energy_usage_df DataFrame contains energy usage data for each customer, detailed hour by hour. Each row in this DataFrame represents an hourly record of energy consumption for a customer. Here's the breakdown:

- NMI: Just like in customers_df, this column contains the National Meter Identifier for each customer. It's used to link each energy usage record to the corresponding customer in customers_df.

- Hour: This column contains datetime objects, each representing a specific hour of a day. For instance, if the date is '2023-01-01', the hourly breakdown will start from '2023-01-01 00:00:00' and go up to '2023-01-01 23:00:00', covering a full 24-hour period.

- kWh (Kilowatt-hour): This column shows the amount of energy consumed during each specified hour. The values are numeric and represent the energy usage in kilowatt-hours. These values are randomly generated in the example, ranging between 0 and 10 kWh.

### Tasks

Note: questions marked with ** may be slightly more difficult

#### Return the Data for One Customer**
For the `energy_usage_df`, return the `kWh` values as a list between 10:00am and 1:00pm inclusive for NMI101. Store the results in a variable called `nmi101_filtered`. Hint: Look up .tolist()

#### Merge the Dataframes
Merge `customers_df` with `energy_usage_df` on the 'NMI' column and store it in a variable called `merged_df`. What does the merged DataFrame look like?

In [None]:
# create a variable called merged_df which holds the merged dataframe


#### Calculate Total Energy Usage
Calculate the total energy usage (kWh) for each customer in the merged DataFrame and assign it to the variable `total_energy_per_customer`

Hint: when you group by a column you generally need to use an aggregation function like sum() or mean()

example: `customer_salary = df.groupby('Name')['Salary'].sum()`

In [None]:
# create a varaible called total_energy_per_customer
# HINT: Use "NMI", not "Name"


#### Who used the most Energy over the 24 Hour period

Using your newly created variable `total_energy_per_customer` find out the name of the customer which used the most energy and assign it to the variable `highest_energy_usage_customer_name`

In [None]:
# create a variable called highest_energy_usage_customer_name and assign the name of the customer
# who had the highest energy usage over the 24 hours


#### Average Energy Usage for Each hour of the day across all customers **

Calculate the average energy usage for each hour of the day across all customers. Asign it to the variable `average_energy_per_hour`

In [None]:
# create a variable called average_energy_per_hour which stores the average energy usage across all customers for each our of 
# the day


#### Calculate which hour had the highest usage
Using your newly created `average_energy_per_hour` dataframe calculate which hour had the highest usage across all customers for the day. Assign it to the variable `highest_usage_hour`

In [None]:
# create a variable called highest_usage_hour which stores the hour which had the highest usage of electricity across
# the entire day


#### Calculate the Age-Energy Correlation Value **
Calculate the correlation value between the age of the customer and the amount of energy usage and assign it to the variable `age_energy_correlation`. If we consider that a perfect correlation between two variables would be a value of `1`, do you think the value is significant enough to draw a causation between the two variables? That is can we accurately predict energy usage by age?


In [None]:
# create a variable called age_energy_correlation and calculate the correlation between a customers age and the energy they use.
# HINT: use the merged_df table. You should return a single float value


In [28]:
# if you're a data wizard run some automated tests on your variables
test_energy_analysis_tasks()

[91mTest 1 Error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().[0m
[91mTest 2 Failed: Testing merging of DataFrames. merged_df is not defined[0m
[91mTest 3 Failed: Testing total energy usage calculation. total_energy_per_customer is not defined[0m
[91mTest 4 Failed: Testing highest energy usage customer. highest_energy_usage_customer_name is not defined[0m
[91mTest 5 Failed: Testing average energy usage per hour. average_energy_per_hour is not defined[0m
[91mTest 6 Failed: Testing highest usage hour. highest_usage_hour is not defined[0m
[91mTest 7 Failed: Testing age-energy correlation. age_energy_correlation is not defined[0m


This Challenge Question is a real world EQL example of how pandas is useful to solve our very complicated business problems. Except we are current scaling up to 2.5 million customers and reading their meters every 5 minutes. That's 720 Millions meter reading records stored in our databases everyday. We then calculate the price for each customer and send it off to the retailers to charge the customer. We estimate that we will store more data in the next two years then we have since the existance of both energex and ergon. Now that's BIGGGGG DATA!!!