# Introduction to Python Programming
Prepared by: Ryan Urbanowicz - Cedars Sinai Medical Center
Acknowledgements: Some content generated using ChatGPT, other content based on a tutorial by Dr. Joshua Levy.

Welcome to Python!
* This Jupyter notebook will guide you through some fundamental concepts of Python within Jupyter Lab.
* The left side of Jupyter lab gives the file browser (e.g. for loading a specific Jupyter Notebook)
---
## Using Jupyter Notebook: Basic Guide
A Jupyter notebook consists of cells. 

**Selecting a Cell:** Clicking to the left of it or when editing a cell hit **Esc** to stop editing and select the whole cell.

**Cell Types:** The two main types of cells you will use are **code cells** and **markdown cells**.
* **Code Cells:** Contain code you want to run.
    * Turn a cell into a **code cell** by first selecting it then either (1) choosing 'Code' from the dropdown menu at the top of the notebook, or (2) hit 'y'.
* **Markdown Cells:** Contain text. The text is written in markdown, a lightweight markup language. Details about writing text in markdown can be found [HERE](https://www.edlitera.com/blog/posts/jupyter-markdown-tutorial).

**Adding a Cell (Below selected):** 
* Click the '+' icon in notebook menu (top of notebook), or
* Hit 'b', or
* Click the 'bar/+' icon at far right of selected cell

**Adding a Cell (Above selected):** 
* Hit 'a', or
* Click the '+/bar' icon at far right of selected cell

**Move Selected Cell (up or down):** 
* Click the up or down arrow icon at far right of selected cell
  
**Deleting a Selected Cell:** 
* Click the 'trash can' icon at far right of selected cell, or
* Hit 'd' twice quickly

**Run a Selected Cell:** 
* Click the 'play' icon in notebook menu (top of notebook)
* Hit 'Ctrl-Enter'
  
**Restart Kernel & Run All Cells in a Notebook:** 
* Click the couble 'play' icon in notebook menu (top of notebook)

**Note:** A "kernel" is a computational engine that runs code from a specific programming language


***
## Hello World - Rite of Passage
* Python can print text as (standard) output to the screen using the **print()** function
* Try running the code cell (below), then try filling in your own code cell and running it

In [None]:
print("Hello, world!")

#### *Try it Yourself*:
*Type a print command into code cell below and run it.*

In [None]:
# Add code to this cell

***
## Fundamentals
***
### Variables and Data Types
Variables store data in Python. Python supports different data types: integers, floats, strings, and booleans.

**Note:** Comments can be added to Python code using '#'. Comments are NOT run by the python interpreter.

In [None]:
# Creates variables with assigned values of different data types
a = 10  # Integer
b = 3.14  # Float
c = "Hello, Python!"  # String
d = True  # Boolean

# Print out the value and data type for each variable
print(a, type(a))
print(b, type(b))
print(c, type(c))
print(d, type(d))

#### *Try it Yourself*
*Create a variable (with any name) and assign a value to it (of any data type), then print it's value and type.*


In [None]:
# Add code to this cell

### Casting Data Types - i.e. Changing Data types 

In [None]:
# Creates variables with assigned values of different data types
e = 30  # Integer
f = 7.56  # Float
g = 0  # Integer
h = False  # Boolean

# Print out the value and data type for each variable
print('Original Data Types of Variables')
print(e, type(e))
print(f, type(f))
print(g, type(g))
print(h, type(h))

#Casting variables to a different type
e = float(e) #cast to float
f = int(f) #cast to integer
g = bool(g) #cast as Boolean
h = str(h) #cast as string

print('New Data Types of Variables')
print(e, type(e))
print(f, type(f))
print(g, type(g))
print(h, type(h))

***
### Basic Arithmetic Operations

In [None]:
my_add = a + b # addition
my_sub = a - b # subtraction
my_product = a * b # multiplication
my_division = a / b # division
my_floor_division = a // b # floor division -returns the largest integer less than or equal to the result of the division
my_modulus = a % 3 # modulus - gives remainder
my_exp = a ** 2 # exponentiation
my_combo_math = ((a+b)*a)-b # example of a simple mathematical equation

print("Addition:", my_add)
print("Subtraction:", my_sub)
print("Multiplication:", my_product)
print("Division:", my_division)
print("Floor Division:", my_floor_division)
print("Remainder:", my_modulus)
print("Exponentiation:", my_exp)
print("Combo Math:", my_combo_math)

#### *Try it Yourself*
*Write a mathmatical expression of numerical variables 'a' and 'b' saved to a new variable, and print it.*

In [None]:
# Add code to this cell

***
### Relational Operators

In [None]:
is_less = a < b # less than
is_more = a > b # greater than
is_less_eq = a <= b # less than or equal to
is_more_eq = a >= b # greater than or equal to
is_equal = a == b # equal to
is_not_equal = a != b # not equal to
print(a,b)
print(is_less, is_more, is_less_eq, is_more_eq, is_equal, is_not_equal)

#### *Try it Yourself*
*Create a variable that is assigned 'True' or 'False' based on a relational operator comparison, and print it.*

In [None]:
# Add code to this cell

***
### Control Flow: If-Else Statements
* **Note:** In Python, indentation is not just for readability—it’s an essential part of the syntax. Python uses indentation (spaces or tabs) to define the structure of code blocks, such as loops, functions, conditionals, and classes. This is known as Python being "tab-sensitive" or "indentation-sensitive".
* **Note:** In Python, the colon (:) plays an essential role in indicating the start of a code block after certain control structures, like if, else, elif, for, while, and function or class definitions.
* Also demonstrates how to combine text and variables into a print statement.
* The function **str()** temporarily forces a variable to be treated as a 'string', i.e. text.

In [None]:
x = 15
y = 10
if x > y: # Note the use of the colon (:) and the tab indentation prior to the print function
    print(str(x)+ " is greater than " +str(y))
else: 
    print(str(x)+ " is " +str(y)+ " or less")

#### *Try it Yourself*
*Create an if-else statement that uses relational operators to compare variable values, then print something unique for whether the comparison is either True or False. Try running the code with different input values for your variables.*

In [None]:
# Add code to this cell

***
### IF-Elif-Else Statements
We can create a set of multiple conditions using if, elif, and else.

In [None]:
x = 5
y = 10
if x > y:
    print(str(x)+ " is greater than " +str(y))
elif x == 5: # a secondary condition that is checked if the first was not True
    print(str(x)+ " is equal to 5")
else: # serves as a catch all if none of the above conditions were satisfied
    print(str(x) + " is neither greater than " +str(y)+ " nor equal to 5")

***
### While Loops
Repeatedly apply a loop until the **while** condition is False.

In [None]:
iteration = 0
while iteration < 5:
    print("While loop iteration:", iteration)
    iteration += 1 # increases variable 'iterations' by 1 each step through the while loop

#### *Try it Yourself*
*Create an while loop statement that uses relational operators as a **while** condition, printing something each loop iteration.*

**Note:** *Make sure at some point the condition will become **False** or the loop will run **infinately**.  If this does happen, click the **stop** icon at the top of the notebook to interupt the kernal (i.e. stop the code from running manually).*

In [None]:
# Add code to this cell

***
### Logical Operators
#### 1. AND - More than one condition must be True

In [None]:
iteration = 0
x = 3
while iteration < 10 and x < 150:
    x = x + x
    iteration += 1 
    print("x: "+str(x))
    print("Iteration: "+str(iteration))

#### 2. OR - Either Condition can be True

In [None]:
iteration = 0
x = 3
while iteration < 10 or x < 150:
    x = x + x
    iteration += 1 
    print("x: "+str(x))
    print("Iteration: "+str(iteration))

#### 3. NOT
Logical operator in Python that negates a boolean value. It returns True if the given expression evaluates to False and vice versa.

In [None]:
x = 15
y = 10
if not x > y:
    print(str(x)+ " is less than or equal to " +str(y))
else:
    print(str(x)+ " is greater than " +str(y))

#### 4. IS 
The is operator is used to compare object identities, meaning it checks whether two variables reference the same object in memory, not just if they have the same value.
**Note:** None is a special singleton object that represents the absence of a value or "nothing."

In [None]:
x = None
if x is None:
    print("x is None")  # Best practice: use 'is' for None comparisons

#### *Try it Yourself*
*Create an if statement or while loop that uses one or more logical operators. Be creative!*

In [None]:
# Add code to this cell

***
### Lists
Storing multiple values of one or more data types 

In [None]:
my_list = [1, True, 3, 'Denver', 5]
print("List contents:", my_list)
print("First element:", my_list[0]) #print the first element, i.e. index 0 (python uses zero-indexing)

#### *Try it Yourself*
*Create a list of 5 elements and print the 3rd element (Hint: the third element would be index '2')*

In [None]:
# Add code to this cell

***
### For Loops
#### 1. Looping through a list

In [None]:
fruits = ["apple", "banana", "cherry"]
for fruit in fruits: # for each element in the list 'fruits'
    print(fruit) # print the individual element 

#### 2. Looping through a range of numbers (starting a 0)

In [None]:
for i in range(5):  # Loops from 0 to 4
    print(i)

#### 3. Looping through a range of numbers (specified range)

In [None]:
for i in range(25,30):  # Loops from 25 to 29
    print(i)

#### 4. Looping through a string

In [None]:
word = "Python"
for letter in word:
    print(letter)

#### 5. Using range(start, stop, step)

In [None]:
for num in range(1, 10, 2):  # Start at 1, stop before 10, step by 2
    print(num)

#### *Try it Yourself*
*Create a for loop that prints individual elements of a list, range of numbers, or a string.*

In [None]:
# Add code to this cell

***
### List Comprehensions
A concise way to create lists in Python using a single line of code. It allows you to build a new list by applying an expression to each item in an existing iterable (like a list, range, or string) with an optional condition.

In [None]:
# Without list comprehension
squares = []
for i in range(5):
    squares.append(i**2)
print(squares)  # Output: [0, 1, 4, 9, 16]

# With list comprehension
squares = [i**2 for i in range(5)]
print(squares)  # Output: [0, 1, 4, 9, 16]

#### *Try it Yourself*
*Create a list, using list comprehension, that includes all odd numbers between 0 and 10. Then print it.*

In [None]:
# Add code to this cell

***
### Dictionaries
Dictionaries store key-value pairs
#### 1. Create a dictionary

In [None]:
# Dictionaries store key-value pairs
student = {
    "name": "John Doe",
    "age": 21,
    "major": "Computer Science",
    "GPA": 3.8
}
print(student) # print entire dictionary
print(student['name']) # print the value for the key, 'name'
print(student['GPA']) # print the value for the key, 'GPA'

#### 2. Add a new Key-value Pair

In [None]:
student["graduation_year"] = 2025
print(student)

#### 3. Looping through a Dictionary
Includes an example of f-string formatting, which allows inserting variables directly into a string.

In [None]:
for key, value in student.items():
    print(f"{key}: {value}") # f-string formatting 

#### *Try it Yourself*
*Create a dictionary and print out the value for a specific key.*

In [None]:
# Add code to this cell

***
## Functions
In Python, a function is a reusable block of code that performs a specific task. Functions allow you to organize your code, make it more modular, and avoid repetition.

### Defining a Function
Below we create an example function that calculates the slope of a line given two points. 
* **def** indicates that we are defining a function.
* **Arguments**, i.e. input values, for the function are specified in parenthesis, separated by commas
* This first line defining the function must end with a colon (:)
* **return** is often used at the end of a function to return one or more values to where the function is called. 

**Note:** We can create blocks of comment text in python, by surrounding it by """ """ (see below). 


In [None]:
def calculate_slope(x1, y1, x2, y2):
    """
    Calculates the slope of a line given two points (x1, y1) and (x2, y2).
    
    :param x1: x-coordinate of the first point.
    :param y1: y-coordinate of the first point.
    :param x2: x-coordinate of the second point.
    :param y2: y-coordinate of the second point.
    :return: The slope of the line.
    """
    # Check if the line is vertical (x1 == x2), which would cause division by zero
    if x1 == x2:
        raise ValueError("The line is vertical, slope is undefined.")
    
    # Calculate the slope using the formula (y2 - y1) / (x2 - x1)
    slope = (y2 - y1) / (x2 - x1)
    return slope  # returns the variabe value after the function completes

### Using a Defined Function

In [None]:
my_x1 = 5
my_y1 = 10
my_x2 = 7
my_y2 = -14
slope = calculate_slope(my_x1, my_y1, my_x2, my_y2) # the output of a function can be saved to a new variable
print(slope)

#### *Try it Yourself - Putting Coding Pieces*
* Create and call a function takes **weight** and **height** as arguments and calculates Body Mass Index (BMI), returning both the **BMI value** and **BMI category**.
* BMI categories can be defined as:
    * BMI < 18.5 is Underweight
    * 18.5 <= BMI < 24.9 is Normal Weight
    * 24.9 <= BMI < 29.9 is Overweight
    * BMI >= 29.9 is Obese

In [None]:
# Add code to this cell

### Appending Values to a List & Creating a More Complicated Function
* The **append()** function adds a new value to a list, increasing it's length.

In [None]:
def process_numbers(numbers, threshold):
    """
    Processes a list of numbers. Doubles the even numbers, triples the odd numbers,
    and checks if the sum of the numbers is greater than a given threshold.
    
    :param numbers: A list of numbers to process.
    :param threshold: A number to compare the sum against.
    :return: A tuple containing the processed list and a boolean indicating 
             if the sum exceeds the threshold.
    """
    processed_numbers = []  # Initializes and empty list to store the processed numbers
    total_sum = 0  # Variable to accumulate the sum of the numbers

    # Loop through the numbers in the list
    for num in numbers:
        if num % 2 == 0:  # Check if the number is even
            processed_number = num * 2  # Double the number
        else:
            processed_number = num * 3  # Triple the number
        
        processed_numbers.append(processed_number)  # Add the processed number to the list
        total_sum += processed_number  # Add the processed number to the total sum
    
    # Check if the total sum is greater than the threshold
    is_greater_than_threshold = total_sum > threshold
    
    return processed_numbers, is_greater_than_threshold # Returns two separate variables

In [None]:
numbers = [1, 2, 3, 4, 5]
threshold = 30

# Call the function
processed_numbers, exceeds_threshold = process_numbers(numbers, threshold)

print("Processed numbers:", processed_numbers)
print("Does the sum exceed the threshold?", exceeds_threshold)

***
## Importing and Using Libraries & Modules
A Python library is a collection of modules (files containing Python code) that provide pre-written functions and tools to perform specific tasks, saving time and effort when writing code.
* Libraries are most commonly imported at the beginning of a Jupyter Notebook (or Python script), by convention, but they can be imported as needed (as done here). 

Libraries can include functions for:
* Mathematics & Statistics (e.g., math, numpy)
* Data Manipulation (e.g., pandas)
* Machine Learning & AI (e.g., scikit-learn, tensorflow)
* Web Development (e.g., flask, django)
* Visualization (e.g., matplotlib, seaborn)

Many libraries have already been installed on your computer as part of the Anaconda installation, however many more exist users may with to install using **conda** or **pip**. 

Below we give a few examples of importing libraries (that have already been installed with the Anaconda installation). 

***
### Library Example: Math

In [None]:
import math # Imports are normally put in the first code cell of the notebook by convention

print(math.sqrt(25))       # Output: 5.0 (Square root of 25)
print(math.pi)             # Output: 3.141592653589793 (Value of Pi)
print(math.sin(math.radians(30)))  # Output: 0.5 (Sine of 30 degrees)

***
### Library Example: Random

In [None]:
import random # Imports are normally put in the first code cell of the notebook by convention

print(random.randint(1, 10))     # Random integer between 1 and 10
print(random.choice(["apple", "banana", "cherry"]))  # Random choice from a list
print(random.uniform(1.5, 5.5))  # Random float between 1.5 and 5.5

***
### Library Example: OS - Interacting with the operating system

In [None]:
import os # Imports are normally put in the first code cell of the notebook by convention

print(os.getcwd())   # Output: Current working directory
print(os.listdir())  # Output: List of files in the current directory

***
### Library Example: Numpy - Working with Arrays
* Somewhat different from **lists**, **arrays**:
    * Store only one data type (e.g., all int or all float)
    * Are faster for numerical computations
    * Use less memory due to fixed data type
    * Support element-wise operations directly (no need for loops)
    * Are best for numerical and scientific computing

In [None]:
import numpy as np # Libraries can be imported and assigned a different name for calling

arr = np.array([1, 2, 3, 4])
print(arr * 2)  # Output: [2 4 6 8] (Element-wise multiplication)

***
### Library Example: Pandas - Working with DataFrames
A **DataFrame** is a tabular data structure provided by the pandas library. It is similar to a table in a database, an Excel spreadsheet, or an R DataFrame.

In [None]:
import pandas as pd # Imports are normally put in the first code cell of the notebook by convention

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

print(df)

***
### Library Example: Matplotlib - Plotting Graphs
The matplotlib.pyplot module helps create visualizations.

In [None]:
import matplotlib.pyplot as plt # Imports are normally put in the first code cell of the notebook by convention

x = [1, 2, 3, 4]
y = [10, 20, 25, 30]

plt.plot(x, y, marker='o')  # Plot a line graph with points
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Simple Line Plot")
plt.show()

***
## Using ChatGPT to Get Python Coding Examples
ChatGPT is an example of a large language model (LLM) that can be very useful for learning more about writing python code and directly generating example code for completing specific tasks. 

Below we give some examples of requests that might be provided to ChatGPT to get it to generate some useful python code. 

***
### Example: Get the area under a line plot.
**ChatGPT Prompt:** Show me how to write python code that calculates the area under a line plot.

***
### Example: Get the p-value for a T-test comparison.
**ChatGPT Prompt:** Show me how to write python code that calculates the p-value between two sets of values using a t-test.

***
### Example: Generating a clustered heatmap of gene expression data with two treatement groups.
**ChatGPT Prompt:** Show me example python code on how to generate a hierarchically clustered heatmap of gene expression data comparing treated and untreated groups.

***
### Example: Get the average of values in a specific dataframe column.
**ChatGPT Prompt:** Show me how to write python code that gets the average of a specific column within a dataframe

***
***
# Using Python for Data Science (Workshop 3)
The sections below are primarily meant to pair with the "Essentials of Machine Learning" Workshop, but you can skip ahead and try running these cells.

In this section we provide a brief demonstration of just some of the many steps that might be included in a machine learning/data science analysis pipeline. 
***
### Loading a Dataset
As a demonstration we use a small public hepatocellular carcinoma (HCC) dataset taken from the UCI repository. 
* This dataset has a binary class 'outcome' (1 = patint survived, and 0 = patient died)
* The data file we will use is in the same folder as this notebook, and is called **hcc_data.csv**
* **Note:** Typically, it is important to identify whether variables in the dataset are **categorical** or **numeric**, as this can impact how we process and analyze the dataset.  Binary variables can be reliably be treated as either. For the purposes of this demonstration we will assume all variables are **numeric**.

In [None]:
# Load the CSV file into a DataFrame
df = pd.read_csv("hcc_data.csv")  # Since the file is in the same directory as this notebook we can just load it by name (without a file path)

# Get the number of rows and columns in the dataset
num_rows, num_columns = df.shape
print(f"The dataset contains {num_rows} rows and {num_columns} columns.")

In [None]:
# Display the first 5 rows of the dataset
print(df.head())

***
### Basic Exploratory Data Analysis
#### 1. Basic Dataset Information For all Columns (i.e. Variables)

In [None]:
# Get basic information about the dataset
print(df.info())

#### 2. Missing Value Counts

In [None]:
# Check for missing values
print("Missing Values in Each Column:")
print(df.isnull().sum())

# Calculate the total number of missing values
total_missing = df.isnull().sum().sum()
print("Total Number of Missing Values in Dataset: "+str(total_missing))

#### 3. Summary Statistics for Numerical Columns

In [None]:
# Generate basic statistics for numerical columns
print("Summary Statistics for Numerical Columns:")
print(df.describe())

#### 4. Bar Plot of Class Counts

In [None]:
import seaborn as sns # Imports are normally put in the first code cell of the notebook by convention

# Define the column containing class labels (replace 'ClassColumn' with your actual column name)
class_column = "Class"  # Change this to match your dataset

# Count the occurrences of each class
class_counts = df[class_column].value_counts()

# Plot the bar chart
plt.figure(figsize=(8, 6))
sns.barplot(x=class_counts.index, y=class_counts.values)

# Add labels and title
plt.xlabel("Class Labels", fontsize=12)
plt.ylabel("Count", fontsize=12)
plt.title("Class Distribution", fontsize=14)
plt.xticks(rotation=45)  # Rotate labels if needed

# Show the plot
plt.show()

***
### Basic Data Cleaning (Of Missing Values)
#### 1. Checking Whether Removing Rows with Missing Values is Viable

In [None]:
# Print number of rows (i.e. instances/samples) in the original dataset
print("Number of rows in original dataset: "+str(df.shape[0]))

# Remove rows with any missing values
df_cleaned = df.dropna()

# Count the number of rows that have no missing values
num_rows_no_missing = df_cleaned.shape[0]
print("Number of rows with no missing values: "+str(num_rows_no_missing))

Here we can see that simply removing any rows with missing values leaves us with very few rows to do anything with. So instead, for the purposes of this demonstration we will **impute** (i.e. make an educated guess) regarding these missing values, replacing them with substitute values. 

#### 2. Imputing Missing Values
* Given that we are assuming all variables are **numeric** we are opting here to use simple **median imputation**
* For categorical variables, a simple imputation strategy could include **mode imputation**

In [None]:
# Apply median imputation to create a new dataframe
df_imputed = df.fillna(df.median())

# Check the total number of missing values in the new dataframe
new_total_missing = df_imputed.isnull().sum().sum()
print("Total Number of Missing Values in Dataset: "+str(new_total_missing))

***
### Unsupervised Learning - Principle Component Analysis & K-Means Clustering
Here we also see yet another way to import useful code, specifically, **StandardScalar**, **PCA**, and **KMeans**. 

e.g. **from sklearn.cluster:** This part tells Python to import from the cluster submodule of the sklearn (scikit-learn) library. The cluster submodule contains algorithms related to clustering (grouping data based on similarities).

In [None]:
from sklearn.preprocessing import StandardScaler # Imports are normally put in the first code cell of the notebook by convention
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# Create a new DataFrame without the 'Class' column
df_x = df_imputed.drop(columns=[class_column])

# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_x)

# Apply PCA for dimensionality reduction (optional)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Apply KMeans clustering
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis', marker='o', alpha=0.5)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('KMeans Clustering')
plt.colorbar(label='Cluster')
plt.show()

***
### Train and Evaluate Machine Learning Models (with 3-fold Cross Validation)
Below is a function to train and evaluate machine learning models when passed the following arguments:
* **local_df:** A target dataframe (with feature columns and an outcome column)
* **n_splits:** The number of cross validation (training and testing) data partitions
* **model:** A specific machine learning algorithm, e.g. decision tree, random forest, etc.

Once called, this function does the following:
* Separate the potentially predictive variables (i.e. features, 'X'), from the outcome (i.e. class, 'y')
* Initialize lists to store evaluation metrics for each CV partition
* Creation of training and testing data splits for each CV partition
* Fitting (i.e. training) the model on training data
* Applying the trained model to make predictions on testing data
* Calculating evaluation metrics for the model's predictive performace
* Ultimately averaging evaluation metrics across all CV partitions and reporting performance

Here we are using another important library, **sklearn** (otherwise known as scikit-learn) that includes many useful machine learning algorithms and related methods.
#### 1. Define Function 

In [None]:
from sklearn.model_selection import KFold # Imports are normally put in the first code cell of the notebook by convention
from sklearn.metrics import accuracy_score, balanced_accuracy_score, precision_score, recall_score, f1_score

def train_model(local_df, n_splits, model):
    # Split data into features and target
    X = local_df.drop(columns=['Class'])
    y = local_df['Class']

    # Initialize evaluation metrics lists
    accuracies = []
    balanced_accuracies = []
    precisions = []
    recalls = []
    f1_scores = []

    # Initialize KFold cross-validation
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    
    # Iterate over each fold of cross-validation
    for train_index, test_index in kf.split(X):
        # Split data into train and test sets for this fold
        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        model.fit(X_train, y_train)

        # Predict on the testing data
        y_pred = model.predict(X_test)

        # Calculate evaluation metrics
        accuracies.append(accuracy_score(y_test, y_pred))
        balanced_accuracies.append(balanced_accuracy_score(y_test, y_pred))
        precisions.append(precision_score(y_test, y_pred, average='weighted'))
        recalls.append(recall_score(y_test, y_pred, average='weighted'))
        f1_scores.append(f1_score(y_test, y_pred, average='weighted'))
        
    # Calculate average evaluation metrics
    avg_accuracy = sum(accuracies) / len(accuracies)
    avg_balanced_accuracy = sum(balanced_accuracies) / len(balanced_accuracies)
    avg_precision = sum(precisions) / len(precisions)
    avg_recall = sum(recalls) / len(recalls)
    avg_f1 = sum(f1_scores) / len(f1_scores)

    # Print summary of model performance
    print("Average Accuracy:", avg_accuracy)
    print("Average Balanced Accuracy:", avg_balanced_accuracy)
    print("Average Precision:", avg_precision)
    print("Average Recall:", avg_recall)
    print("Average F1 Score:", avg_f1)

    return [avg_accuracy, avg_balanced_accuracy, avg_precision, avg_recall, avg_f1]

#### 2. Train/Evaluate Decision Tree Models

In [None]:
from sklearn.tree import DecisionTreeClassifier # Imports are normally put in the first code cell of the notebook by convention
# Train a decision tree model
n_splits=3
model = DecisionTreeClassifier() # Here we don't sepcify any hyperparameter values so algorithm defaults will be used
metric_list_1 = train_model(df_imputed, n_splits, model)

#### 3. Train/Evaluate Random Forest Models

In [None]:
from sklearn.ensemble import RandomForestClassifier # Imports are normally put in the first code cell of the notebook by convention
# Train a decision tree model
n_splits=3
model = DecisionTreeClassifier() # Here we don't sepcify any hyperparameter values so algorithm defaults will be used
metric_list_2 = train_model(df_imputed, n_splits, model)

#### 4. Basic Average Metric Comparison Plot Between Algorithms

In [None]:
import matplotlib.pyplot as plt # Imports are normally put in the first code cell of the notebook by convention
import numpy as np

# Create a range of positions for the bars
x = np.arange(len(metric_list_1))

# Set the width of the bars
width = 0.35

# Create the bar chart
fig, ax = plt.subplots()

# Plot the bars for both lists
bars1 = ax.bar(x - width/2, metric_list_1, width, label='Decision Tree')
bars2 = ax.bar(x + width/2, metric_list_2, width, label='Random Forest')

# Add some text for labels, title, and custom x-axis tick labels
ax.set_xlabel('Metric Index')
ax.set_ylabel('Values')
ax.set_title('Comparison of Evaluation Metrics (Decision Tree vs. Random Forest)')
ax.set_xticks(x)
ax.set_xticklabels(['Accuracy', 'Bal. Accuracy', 'Precision', 'Recall', 'F1 Score'])
plt.xticks(rotation=45)

ax.legend(loc='upper left', bbox_to_anchor=(1, 1))

# Show the plot
plt.tight_layout()
plt.show()

***
### Generating an ROC plot to evaluate a machine learning model

In [None]:
from sklearn.linear_model import LogisticRegression # Imports are normally put in the first code cell of the notebook by convention
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split

# Split data into features and target
X = df_imputed.drop(columns=['Class'])
y = df_imputed['Class']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a binary classifier (Logistic Regression in this case)
model = LogisticRegression()
model.fit(X_train, y_train)

# Get the predicted probabilities (not just the class labels)
y_pred_prob = model.predict_proba(X_test)[:, 1]  # Get probabilities for the positive class (class 1)

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

# Calculate AUC (Area Under the Curve)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')  # Diagonal line (random classifier)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)

# Show the plot
plt.show()

***
## Using ChatGPT to Help Write Python Code for Data Science
ChatGPT is an example of a large language model (LLM) that can be very useful for learning more about writing python code and directly generating example code for completing specific tasks. 

Below we give some examples of requests that might be provided to ChatGPT to get it to generate some useful python code. 

***
### Example: Applying Iterative Imputation for Missing Data Values
**ChatGPT Prompt:** Write me python code to apply iterative imputation to a dataframe to replace missing values.

***
### Example: Stratified Cross Validation
**ChatGPT Prompt:** Write me python code to apply stratified cross validation to a dataset balancing the 'Class' column as closely as possible across partitions. 

***
### Example: Hyperparameter Sweep
**ChatGPT Prompt:** Write me python code that completes a simple hyperparameter sweep for a decision tree algorithm

***
### Example: Boxplot Comparing Algorithm Performance
**ChatGPT Prompt:** Write me python code to generate a boxplot comparing the F1-scores across 10 CV partitions for two algorithms.

***
### Example: Estimating Model Feature Importance
**ChatGPT Prompt:** Write me python code to calculate model feature importance estimates following model training. 