# Weclome

### What is this?
This course _(Advanced AI: Using Large Language Models for Data Analysis)_ is designed to introduce basic programming skills. The focus is on practical applications of coding in healthcare settings, with the aim of getting you comfortable with the basics; we hope you can apply these skills to automate the boring things in your life,  maybe use some of this in a research project, and overall just feel less intimidated by the idea. 

### Where am I?
This is a private "Jupyter notebook" hosted on one of the AIM lab comupters. Jupyter notebooks are a mixture of a normal document, like Microsoft Word, mixed in with usable programming areas. This makes it really easy to follow along with someone elses code or do a tutorial. The most commonly-used noteboks these days is [Google Colab](https://colab.google.com). Google offers a free notebook to every Google account, where you can run your own scripts without downloading anything! This website, the AIM Lab Cloud, is similar to Google collab, except that it is private and pre-loaded with some data for today's exercise. 

# About Notebooks
### Cells
Notebooks are broken into "cells" which have one can either contain just text (like this one) or can contain some code; this code can be edited and run. 

The cell below this is a simple command in Python to print out a phrase with a name. Click on it to edit it, and change the name to you own. Keep the quotation mark ```"``` in place and only change the name.  Once you are satisfied, press the "play" button (►) at the top or press shift-enter to run that cell.

In [1]:
print("Hi there, Harry Lemmens!")

Hi there, Harry Lemmens!


Great job! You have successfully run (or _executed_) your first python script. See, it's not that bad. What if we wanted to seperate the name from the rest of the command? To do that, we'd need to use a __variable__.

# Variables
Variables are words that represent some other value; the name of a variable can only be letters and numbers (generally lowercase, with spaces  swapped for underscores). There are a bunch of different types of variables, but the most important ones are:

- integers (whole numbers)
- floats (numbers with decimal places)
- strings (sets of characters, such as a word or a sentences)
- boolean (true or false, ie bianary)
- lists (groups of other variables ordered by their position in the list)
- dictionaries (groups of variables ordered by word labels)

To assign a value to a variable, you use the equals sign. The value on the right side is stored in the variable on the left side of the equals sign. 

Note: we can place comments into our code by prefixing something with the pound/hash ```#``` symbol. Comments aren't "run" but are just there to help someone understand the code.

In [2]:
# integer
big_number = 1025

# float
another_big_number = 2310.2315

# strings are wrapped in single quotes or double quotes
a_good_sentence = "Just put the tube in" 

# booleans are either true or false. In python, True and False are capitalized
tube_is_in = True

# lists are made using square brackets, with elements seperated by commas. Lists can be made of different variable types.
good_variables = [ big_number, another_big_number, a_good_sentence, tube_is_in ]

# we can print the whole list to see what is stored inside
print(good_variables)

[1025, 2310.2315, 'Just put the tube in', True]


Excellent! Notice that the string still has some quotation marks around it - that's how you know it is a string, and how True is capitalized. If it was lowercase, python would think it was a variable name and go looking for what its value was. 

If we wanted to get just part of a list, we would need to access the elements using their position (also known as _index_). The first position is zero, so if we wanted the value of ```tube_is_in``` we'd have to get the element with the index of three. Here is how you do that:


In [3]:
# to access an element within a list, use square brackets after the list name
print(good_variables[3])

True


Remembering the index within a list is annoying and error-prone, so there is a better way, using _dictionaries_. Dictionaries are like lists except that we can use a string (ie, a set of characters) to reference the element, instead of having to remember its index number. 

In [4]:
# dictionaries are like lists but are made using the curly brackets. Instead of relying on their
# position in the list as their label, they have a key. The main way to assign them is using the colon symbol. 
residency_facts = { "Program Name": "Stanford Anesthesia",
                    "Program Director": "Marianne Chen", 
                    "Number of Residents": 52, 
                    "Enough Cardiac Experience": True}

# to access an element, you use a similar method to lists, with square brakets, but instead of the index number, you use the key
residency_facts["Program Director"]

'Marianne Chen'

Wonderful! So now we know how to store data in variables. By itself, that is not very useful. To make a useful program, we generally need to compare values and make decisions based on those values. To do that, we need conditional statements, a way to guide the program on what to do based on the value of a variable. The big one is the if/else statements - compare the values of two variables and do something based on that comparison.

# Conditional statements

Note that if we want to ask whether two variables are equal to each other, we can't use the same phrasing as below "name = value" because that is used for assignment. To compare two values, we use double equals signs ```==```. 

To break up the sections into which instructions to do when, tabs or spaces are used to create a paragraph of statements to complete.


In [5]:
# assigning the variable big_number a value of 1025
our_number = 1025

# comparing the value of a variable to the integer 1025
if our_number == 1025:
    print("That's the same number as before, a big number.")

That's the same number as before, a big number.


You can also include ```else``` statements, and you can compare values not just for equality, but size difference.

In [6]:
our_number = 1025

if our_number > 2000:
    print("Wow such numbers")
else:
    # should be less than or equal to 2000
    print("Rookie numbers")

Rookie numbers


You can compare any type of value, not just numbers.


In [7]:
old_program_director = "Alex Macario"
residency_facts = { "Program Name": "Stanford Anesthesia",
                    "Program Director": "Marianne Chen", 
                    "Number of Residents": 52, 
                    "Enough Cardiac Experience": True}

if residency_facts["Program Director"] == old_program_director:
    print("Same program director as year 2010")
else:
    print("New program director compared to 2010")


New program director compared to 2010


Wonderful. 

# Iteration

You might say: "That is all great guys, but these are really easy and I don't really need a computer to do this kind of work for me. I want a computer for tasks that require a lot of repetative effort."

This is where iteration can be helpful. Iteration is a way to complete a task many times. For example, imagine we had a list of people's names and we wanted to print their name with the title "Dr" in front of it. Doing this manually would take a long time, but we can do it quickly in python.

The most common way to iterate is to do a action repeatedly for each element in a list. In some programming languages, this is called ```foreach```. In python, it is just called a ```for``` loop. In a for loop, we give the program a temporary variable that represents the element that we're working on, and it moves through the list until the list is complete.


In [8]:
# create a list of names
ob_faculty_list = ["Gill Abrir", "Brian Bateman", "Andrea Traynor"]
for ob_faculty in ob_faculty_list:
    print("Dr. " + ob_faculty)
    

Dr. Gill Abrir
Dr. Brian Bateman
Dr. Andrea Traynor


# Functions

With everything up to this point, you could build almost any program imaginable. But it would be painful witout functions - a way to collect processes into organized units.

Functions don't do anything that you haven't learned yet, they just make programming easier by reducing the amount of repetition in your code. The work similiarly to functions in math - you have an input variable, some internal processing, and an output variable. In coding, you can have more than one input and more than one output. Let's make the world's most boring function - add "Dr. " to someone's name.

In [9]:
# you start defining a function with the keyword "def" 
# list the input variables within paranthesis, followed by a colon

def medical_school(person_name):
    doctor_name = "Dr. " + person_name
    return doctor_name

# to run a function, use its name followed by the inputs in parathesis
some_guy = "Billy Bob Man"
foo = medical_school(some_guy)
print(foo)

Dr. Billy Bob Man


# Data analysis with Python

In [10]:
# Introduction to Pandas and Dataframes

import pandas as pd
import matplotlib.pyplot as plt

# Creating a simple dataframe
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
    'Age': [25, 30, 35, 28, 22],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Boston'],
    'Salary': [50000, 75000, 80000, 65000, 45000]
}

df = pd.DataFrame(data)

print("Our dataframe:")
print(df)

# Basic operations

# Accessing a column
print("\nAccessing the 'Age' column:")
print(df['Age'])

# Adding a new column
df['Experience'] = [3, 7, 10, 5, 1]
print("\nDataframe after adding 'Experience' column:")
print(df)

# Basic statistics
print("\nBasic statistics of numerical columns:")
print(df.describe())

# Filtering
print("\nEmployees older than 25:")
print(df[df['Age'] > 25])

# Sorting
print("\nDataframe sorted by Salary (descending):")
print(df.sort_values('Salary', ascending=False))

# Grouping and aggregation
print("\nAverage salary by city:")
print(df.groupby('City')['Salary'].mean())

# Simple visualization
plt.figure(figsize=(10,6))
plt.bar(df['Name'], df['Salary'])
plt.title('Salary Distribution')
plt.xlabel('Name')
plt.ylabel('Salary')
plt.show()

# Saving to CSV
df.to_csv('employee_data.csv', index=False)
print("\nDataframe saved to 'employee_data.csv'")

# Reading from CSV
df_from_csv = pd.read_csv('employee_data.csv')
print("\nDataframe read from CSV:")
print(df_from_csv)

ModuleNotFoundError: No module named 'pandas'

We are going to go through a practice exercise with medical resident physicians on how to use ChatGPT Data Analyst. First, create me a data set that is 100 rows about teen obesity in the United States and interesting variables so we can practice data analysis. 

Clean and format this data if needed. 

Analyze this data set and give me some insights about obesity among teens.
 
Visualize these insights.

Create a pie chart for BMI category distribution among teens in the U.S.
 
Now, create a new synthetic data set with 12 rows including hospital's revenue, surgical volume, patient volume, weather, and other interesting variables by month. Clean and format this data. 

Create a line graph showing revenue for this specific year. Make this a dual axis line graph and add in surgical volume on the second y-axis. 

Give me links where I can download the cleaned data set as •csv file and the python code for the analysis.

