In [None]:
import warnings
warnings.filterwarnings("ignore")

## 🧠 Let's Check In!

## Go to 👉 [www.menti.com](https://www.menti.com)  
## and enter the code: **6247 1541**
## 
## > 💬 Answer the question on screen — your response will appear live!

In [None]:
import pandas as pd
# Load data into pandas DataFrame from "/lakehouse/default/Files/HR_file.csv"
df = pd.read_csv("/lakehouse/default/Files/HR_file.csv")
display(df)


## 🔢 NumPy: The Powerhouse of Numerical Computing

###### NumPy is a powerful library for **efficient numerical computing in Python**.  
###### It provides **fast mathematical operations**, **multi-dimensional arrays**, and **integration with other data tools**.
###### 
###### ✅ Why Use NumPy?
###### - 🚀 **Efficient Array Handling** – Much faster than Python lists
###### - 📏 **Multi-dimensional Array Support** – Supports matrices, tensors, and more
###### - 🔢 **Slicing & Indexing** – Similar to lists but more powerful
###### - 🤝 **Integrates with**:
- ###### **Pandas** – Forms the backbone of DataFrames & Series
- ###### **Matplotlib** – Used for visualization in Python
- ###### **Scikit-learn** – Essential for Machine Learning
###### 
###### ⏳ Performance Comparison
###### - ✅ **Python List Time:** Much slower than NumPy  
###### - 🚀 **NumPy Array Time:** Optimized for fast calculations  
###### 

##### 🔢 1. Multi-Dimensional Arrays in NumPy
###### ➡️ NumPy makes it easy to work with 2D and 3D arrays — just like Excel sheets or image data.

In [None]:
import numpy as np

# Create a 2D array (like a matrix)
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])

print("📐 2D Array:\n", matrix)

# Access the first row
print("First row:", matrix[0])

# Access the item at row 2, column 3 (index [1, 2])
print("Item at (2, 3):", matrix[1, 2])

# Slice a sub-matrix (rows 1–2, columns 1–2) nb the end is exclusive, but the start is inclusive!!
print("Sub-matrix:\n", matrix[0:2, 0:2])


##### 📊 2. Simple Broadcasting (Scalar + Array)

In [None]:
print("📐 matrix:\n", matrix)

# Add 10 to every element (scalar broadcasting)
print("Add 10:\n", matrix + 10)

# Multiply all values by 2
print("Double values:\n", matrix * 2)

##### 🔄 3. Row-wise Broadcasting

In [None]:
print("📐 matrix:\n", matrix)
# Add a 1D array to each row (broadcasts across rows)


row_add = np.array([1, 10, -1]) # for each row do this to each column
print("Row-wise add:\n", matrix + row_add)

##### 🔁 4. Column-Wise Broadcasting and `.reshape()`

If you want to add values **down the columns** of a 2D matrix (i.e., column-wise),
you need a column vector. You can create this by reshaping a 1D array.

This ensures the shape lines up for broadcasting:
- ✅ Shape of matrix: `(rows, columns)`
- ✅ Shape of column vector: `(rows, 1)` – one value per row


In [None]:
print("📐 Matrix:\n", matrix)

# Create a column vector with one value per row
col_add = np.array([10, 20, 30]).reshape(3, 1)

print(col_add) # similar to transpose in excel

# Broadcast column vector across each row
print("Column-wise add:\n", matrix + col_add)

##### 📈 5. Visualise a NumPy Array with Matplotlib
###### ➡️ Turn your array into a quick line chart

In [None]:
import matplotlib.pyplot as plt

# Create a simple NumPy array
data = np.array([2, 4, 6, 8, 10, 12])

# Plot the array
plt.plot(data)
plt.title("📊 Simple Line Plot from NumPy")
plt.xlabel("Title")
plt.ylabel("Value")
#plt.grid(True)
#plt.show()


In [None]:
import time
import numpy as np

# Generate a large list of numbers
numbers = list(range(1000000))

# Time regular Python loop
start = time.time()
doubled_list = [n * 2 for n in numbers]
end = time.time()
print(f"⏳ Python list time: {end - start:.9f} seconds")

# Time NumPy operation
np_array = np.array(numbers)
start = time.time()
doubled_np = np_array * 2
end = time.time()
print(f"⚡ NumPy time: {end - start:.9f} seconds")


### ✅ Why This Matters
###### List comprehensions are fast but loop-based.
###### 
###### NumPy is vectorised, using compiled C under the hood — usually much faster for large arrays.

# 🧪 Exercise Time!

Now it’s your turn to practise! Download the NumPy Exercises from the github site if you haven't done so already.

✅ Try filling in the missing parts in the code cells below  
✅ Don’t worry if you get stuck – ask questions, test things out  
✅ Use the comments and examples as guidance  

Remember: **practice builds confidence** 💪


# 🐼 Pandas: The Essential Library for Data Analysis

###### **Pandas** is a powerful Python library for **high-performance data manipulation and analysis**.  
###### It simplifies working with **structured data**, making it an essential tool for data scientists and analysts.
### 
<br><br><br>
### ✅ Key Features of Pandas
###### - 📊 **DataFrame Manipulation** – Intuitive handling of tabular data with powerful indexing.
###### - ⏳ **Time Series Analysis** – Advanced tools for working with time-stamped data.
###### - 🛠 **Data Cleaning & Preparation** – Easily handle missing values, transformations, and preprocessing.
###### - 📂 **File Format Compatibility** – Import/export data from CSV, Excel, SQL, and more.
###### - 🔗 **Merging & Joining** – Combine datasets efficiently using smart indexing.
<br><br><br>
### 🔥 Why Use Pandas?
###### - **Efficient**: Optimized for speed and memory usage.
###### - **Flexible**: Works with many file formats and integrates with NumPy & Matplotlib.
###### - **Easy to Learn**: Simple, yet powerful syntax.
###### 
###### 🚀 Pandas **forms the backbone of modern data science**, providing an easy-to-use interface for **cleaning, transforming, and analyzing data**.


### 🔹 Attribute
###### An attribute is a value or property of an object. You access it without parentheses. Think of it as stored information, eg df.columns, df.shape

### 🔹 Method
###### A method is like a function, but it's tied to a specific object. You call it with parentheses, and it usually acts on or with that object, eg, df.head(), df.describe()

## 🧰 Exploring Your DataFrame in Python
###### When working with data, it's important to understand what your DataFrame contains. Here are some common and useful methods and attributes for exploring a pandas DataFrame:
###### 
###### .head() – Shows the first few rows (default is 5)
###### 
###### .tail() – Shows the last few rows
###### 
###### .columns – Lists all column names (an attribute, not a method)
###### 
###### .dtypes – Shows the data type of each column
###### 
###### .describe() – Gives summary statistics for numeric columns
###### 
###### .isnull().sum() – Counts missing values in each column
###### 
###### These are foundational tools for any data analyst or beginner working with pandas.

In [None]:
import pandas as pd
import numpy as np

# Create a DataFrame from a dictionary
data = {
    'Temperature (C)': [20, 22, 24, 26, 28],
    'Humidity (%)': [30, 35, 40, 45, 50]
}

#This creates a DataFrame with two columns, each column is essentially a NumPy array under the hood
dfWeather = pd.DataFrame(data)
print(dfWeather)
type(dfWeather)

In [None]:
print(f"First five rows of the data:\n{dfWeather.head()}\n")
print(f"Last five rows of the data:\n{dfWeather.tail()}\n")
print(f"Column names:\n{dfWeather.columns}\n")
print(f"Data types of each column:\n{dfWeather.dtypes}\n")
print(f"Summary statistics:\n{dfWeather.describe()}\n")
print(f"Number of missing values per column:\n{dfWeather.isnull().sum()}\n")

In [None]:
# Create a date range
dates = pd.date_range(start="2025-03-01", periods=5, freq="D")


# Set the date index
dfWeather.index = dates
print(dfWeather)

print(dfWeather.index)
print(type(dfWeather.index))


### 📅 Date as Index – What and Why?

When you set a date column as the index in a DataFrame, Pandas converts it to a **`DatetimeIndex`**.

#### ✅ What is a `DatetimeIndex`?
- It’s a special type of index in Pandas designed for time-based data.
- It allows smart slicing, filtering, and resampling using dates.

#### 🔍 Why it’s useful:
- You can easily select rows by date:
  ```python
  df.loc['2025-03-03']


In [None]:
print(dfWeather.loc['2025-03'])


###### You can resample data by month, week, day, etc.:

In [None]:
dfWeather.resample('M').mean()

https://app.datacamp.com/learn/tutorials/loc-vs-iloc


##### 🔍 Using .iloc (integer position)

In [None]:
# Row at index 2
print(dfWeather)

In [None]:
print(dfWeather.iloc[2])
# Output:
# Temperature (C)    24
# Humidity (%)       40

In [None]:
# Element at row 1, column 0
print(dfWeather.iloc[1, 0])  # Output: 22

In [None]:
# First 3 rows, both columns
print(dfWeather.iloc[:3, :])

##### 🔍 Using .loc (label index)
###### 


In [None]:
print(dfWeather)

# Row with date '2025-03-03' 
print(dfWeather.loc['2025-03-03'])



In [None]:
# Element at date '2025-03-04', column 'Humidity (%)'
print(dfWeather.loc['2025-03-04', 'Humidity (%)'])

In [None]:
# Rows from '2025-03-02' to '2025-03-04', specific columns.  NB: because it is a specific value at the end, that value is included 
print(dfWeather.loc['2025-03-02':'2025-03-04', ['Temperature (C)', 'Humidity (%)']])



In [None]:
# Add a new column for temperature in Fahrenheit
# note the syntax to select a column
dfWeather['Temperature (F)'] = dfWeather['Temperature (C)'] * 9/5 + 32
print(dfWeather)


In [None]:
# Filter rows where Humidity is greater than 40%
# Note the syntax - first set of brackets sepcifies the context, eg, filtering, 
# second set applies the boolean mask to the dataframe and selects only those rows

high_humidity = dfWeather[dfWeather['Humidity (%)'] > 40]
print(high_humidity)

## Let's put it all together with data - you will need the HR_file.csv

# 🧪 Exercise Time!

Now it’s your turn to practise! We are going to upload the HR_File.csv from the git hub site

✅ Try filling in the missing parts in the code cells below  
✅ Don’t worry if you get stuck – ask questions, test things out  
✅ Use the comments and examples as guidance  

Remember: **practice builds confidence** 💪


In [None]:
# Example: Sequential steps
print("Step 1: Importing data")

import pandas as pd
import numpy as np

# Load data into pandas DataFrame from "/lakehouse/default/" + "Files/HR_file.csv"
df = pd.read_csv("/lakehouse/default/" + "Files/HR_file.csv", delimiter=',')

print("Step 2: Data loaded successfully.  \nHere's the dataframe")
display(df)


Add in summary from exercise to next section

## 📥 Input and Output Readers in Pandas

###### Pandas allows for easy **data export** to various formats, including **CSV**, **Excel**, and **SQL**.  
###### When working with **Microsoft Fabric Lakehouse**, you need to use the correct file path format.
###### 
### ✅ Example: Saving a DataFrame to a CSV in the Default Lakehouse

###### 📌 **Key Notes:**
###### - The **default Lakehouse path** is `/lakehouse/default/`.
###### - The **filename and folder structure** must be specified correctly.
###### - Setting `index=False` ensures that the DataFrame index is **not saved** in the CSV.

### 📂 Other File Formats Supported by Pandas
| **Format**  | **Save Method** | **Read Method** |
|------------|----------------|----------------|
| CSV        | `df.to_csv('file.csv')` | `pd.read_csv('file.csv')` |
| Excel      | `df.to_excel('file.xlsx')` | `pd.read_excel('file.xlsx')` |
| JSON       | `df.to_json('file.json')` | `pd.read_json('file.json')` |
| Parquet    | `df.to_parquet('file.parquet')` | `pd.read_parquet('file.parquet')` |
| SQL        | `df.to_sql('table', conn)` | `pd.read_sql('SELECT * FROM table', conn)` |

###### 🚀 **Pandas makes input and output operations seamless across multiple formats!**


## 🔹 Saving Data to CSV

In [None]:
# Save to csv
dfWeather.to_csv('/lakehouse/default/' + 'Files/dfWeatherVancouver.csv', index=False)

## 🔹 Importing Data from CSV

In [None]:
import pandas as pd

# Define base path as a variable
base_path = "/lakehouse/default/Files"

# Combine with filename
file_name = "HR_file.csv"
file_path = f"{base_path}/{file_name}"

# Load data
df = pd.read_csv(file_path, delimiter=',')

# Display
display(df)