# Introduction to Python

Welcome to python! In this notebook you'll learn some basic tenants of python coding and using a Jupyter notebook! Make sure to read the instructions carefully and follow along.

### Chapters:
- Using a Jupyter Notebook
- Intro to Python Syntax
- Some Data Structures: lists, arrays, and dataframes
- What is a Function?
- Running a linear regression
- Common errors

# Using a Jupyter Notebook
There are two main components or "cells" in a Jupyter Notebook:
- Markdown cell
- Python/Code cell

## Markdown Cells
The cell you're reading right now is a **Markdown** cell. For the assignments, this will contain the instructions for running the notebook. We can also add nicely formatted equations in a mark **Markdown** cell:

$$y=mx+c$$

You **do not** need to know the syntax of how to use a Markdown cell, they will only be for instructions. Any interactions with the Markdown file will have explicit instructions.

## Python/Code Cell

Below we have a python or code cell:

In [None]:
# This is a python cell
print("Hello, World!")

These are **interactive** this is where we write the code. At the top of the page there is a play button that says **Run**. If you click that button it should print this message right below the cell:

"Hello, World!"

Notice the `print("Hello, World!")` line: this is telling python to output the message "Hello, World!"

I've added an empty cell below with some quotation marks. Try playing with the code and add whatever message you want **inside of the quotation marks**. Then run the cell to see if things work as expected.

In [None]:
# Add your message in the print statement below
print("REPLACE THIS WITH YOUR MESSAGE")
# Then run the python cell

There is also some basic syntax you will need to be aware of, such as comments. Most *major* instructions will be given in the markdown cells, but some of the instructions will be given to you within the python cell. This is done using comments, this is "non code" language that is used to explain the code. 

Comments come after a hashtag: `#`

In [None]:
# This is a comment, this does not run any code and is only used for notes

Now you should be ready to run a Jupyter Notebook!

# Python syntax: Variables

Below we'll learn some very basic syntax on python, i.e. how we store information on python and tell it what to do. 

## Printing & Strings

The first thing we learned from above was how to get python to "print" a message.

This is done by using the `print()` statement and putting a *string* inside the print. A *string* is surrounded by quotations `""` and can contain any form of text such as sentences, words, or numbers.

In [None]:
print("This will print a string")

Go ahead and run the cell above and look at the output

### Outputs

This brings us to outputs. We've already encountered this in the examples above. Explicitly, the outputs of the python codes appear *below* the cell after running it. This is also where any errors will appear!

In [None]:
# Run this cell to see the output
print("This will output a string with some numbers: 1, 2, 3, 4, 5")

## Variables

We'll now learn about variables. Variables are very important in python, they are how we store any sort of information in python. 

Variables are created by defining a variable name and adding an equals sign after the name `=`. The data we want to store in the variable is written to the *right* of the `=`. 

For example, if I want to define a variable for the number *7* and label it *seven*, the syntax will be:

`seven = 7`

And in python: (go ahead and run the cell below)

In [None]:
# A variable for the number 7
seven = 7

Notice that when you run this code (hit the play button over the cell), there is no output! This is because variables are simply stored *in memory*. If we want the variable to be outputted, we need to write a print statement and put the variable name inside the parentheses.

In [None]:
# Here we are print the variable seven
print(seven)

### Variables can be anything

Variables can be any sort of data, not just numbers. It can also be strings, which we learned about above.

In [None]:
# This is a variable storing a string
greeting = "Hello, World!"
# Now let's print the variable greeting
print(greeting)

Now there are some rules while naming variables:
- The name cannot have numbers in it, it must be words
- There are no spaces in variable names, if you need to add a space use an underscore: `double_word = "Two word variable"`
- There needs to be a variable name and some sort of data on the *right* of the equals sign, otherwise the statement is incomplete: `incomplete_variable =` and running this code will give you an error.

### Doing math using variables

Now let's get comfortable using variables. We'll use python to calculate the molar mass of some hydrocarbons. The first step here is to define variables with the atomic masses of carbon and hydrogen:

In [1]:
# Below are the variables for the atomic masses of Hydrogen and Carbon
mass_H = 1.0079
mass_C = 12.0107

Make sure to run the cell above!

Now let's find the mass for methane: $CH_4$

So we need to define variables for the number of carbons and hydrogens:

In [4]:
# Defining variables for the number of atoms of each element in a molecule of methane
num_H = 4
num_C = 1

Again, run this cell!

To find the molar mass, we will do some calculations! To tell python to add two variables we simply use a plus sign `+` and for multiplication, we use a star `*`. You'll notice that this is the exact same as in Excel!

We can also use parentheses, just like we would mathematically.

In [None]:
# Now to find the molar mass of methane
molar_mass = (mass_H*num_H) + (mass_C*num_C)
# Now let's print the molar mass
print(f'Molar mass of methane: {molar_mass} g/mol')

One thing you'll notice is that the format inside the print statement is not something we've seen before! This is because I'm using something known as an *f-string*. You **do not** need to know what an f-string is, but it's a very clean way to have both strings and variables in one print statement.

Now, your task is to find the molar mass of ethane: $C_2H_6$

Below, I've added all the variables, you need to complete the *right-hand side* definitions

Note for me: make a nother small notebook to calculate the molar mass of another hydrocarbon

In [None]:
# Here are the masses, make sure to define the variables
mass_H =
mass_C = 
# Now define the number of atoms of each element in a molecule of ethane
num_H = 
num_C = 

Now run this cell after defining the variables! ***Remember, we always need to run each cell, this stores the variable in python. If you miss running a cell, then python will not remember that you defined the variable.***

Now, below I've added an incomplete statement for the molar mass. Use the syntax you learned above to add the computation for the molar mass, The answer should be: 30 g/mL

In [None]:
# Below is the molar mass calculation for ethane
# remember we need to multiply the mass of each element by the number of atoms of that element in the molecule
# and then sum the results, same formula as for methane
molar_mass = 
# Now let's print the molar mass: no need to add anything to this line
print(f'Molar mass of ethane: {molar_mass} g/mol')

### Some extra math syntax
I'll list down all the math syntax you'll need, most of them are the same as in Excel, except for exponents!
- Adding: `+`
- Subtracting: `-`
- Multiplication: `*`
- Division: `/`
- Exponent: `**`
- Parentheses: `()`

In [None]:
# Some examples of mathematical operations: feel free to play with the numbers
addition = 2 + 2
subtraction = 3 - 1
multiplication = 2 * 3
division = 8 / 4
exponentiation = 2 ** 3
# Add order of operations
order_of_operations = (2 + 2) * 4 / 2

# Now let's print the results
print(f'Addition: {addition}')
print(f'Subtraction: {subtraction}')
print(f'Multiplication: {multiplication}')
print(f'Division: {division}')
print(f'Exponentiation: {exponentiation}')
print(f'Order of Operations: {order_of_operations}')

# Data Structures: lists, arrays, and strings

## Lists

One data structure that we won't use often in these labs, but you should know about is a *list*. We need to learn about lists as they are the basis of the other more complicated structures we'll need.

A list is, as expected: a list of variables, numbers, or strings! They are defined by adding *elements* inside a square bracket `[]` and separating each element in the list using a comma. So, if I wanted a list of numbers 1-5 we would do this:

In [None]:
# Printing a list of numbers from 1 to 5
print([1, 2, 3, 4, 5])

When you run the cell you'll notice that it prints out the list exactly as we wrote it in the syntax.

A list can be stored in a variable and can hold any types of data: numbers or strings

In [None]:
# Here is a list with random elements
random_list = [1, 'hello', 3.14, 'world', 5]
# Now let's print the list
print(random_list)

Now let's make two lists with some random numbers and see what happens when we try to add them.

In [None]:
# Two lists with random numbers of the same length
list_1 = [1, 2, 3, 4, 5]
list_2 = [6, 7, 8, 9, 10]

# Adding the two lists
add_list = list_1 + list_2

# Now let's print the added list
print(add_list)

Run the cell above, and what do you see?

The elements were not added mathematically! Instead, we see something else known as *concatenation*, where the elements of the lists were combined to form a new, longer list. This is why, to do math, we need a new type of data structure.

## Arrays

The best way to do math with lists is to convert them into arrays. The best way to view arrays is as vectors, so one column of data that can be as many rows as needed. To create arrays we will need to pull up a module known as `numpy`.

Run the cell below, and what do you see?

In [None]:
# Let's import numpy
import numpy as np

# let's create two lists of numbers
list_1 = [1, 2, 3, 4, 5]
list_2 = [6, 7, 8, 9, 10]

# Now let's convert the lists to arrays
array_1 = np.array(list_1)
array_2 = np.array(list_2)

# Now let's add the two arrays
add_array = array_1 + array_2

# Now let's print the added array
print(add_array)


As expected, the data is now added properly! We can do this for any two arrays *as long as they are the same size*. And any operation can be done.

In [None]:
# Now let's try doing other mathematical operations with the arrays
mult_array = array_1 * array_2
div_array = array_2 / array_1
exp_array = array_1 ** array_2

# Now let's print the results
print(f'Multiplication: {mult_array}')
print(f'Division: {div_array}')
print(f'Exponentiation: {exp_array}')

We can also do math with arrays and numbers or scalars.

In [None]:
# We can add a number to an array
add_num_array = array_1 + 5
# We can also multiply an array by a number
mult_num_array = array_1 * 2
# Let's square the array
squared_array = array_1 ** 2

# Now let's print the results
print(f'Addition: {add_num_array}')
print(f'Multiplication: {mult_num_array}')
print(f'Squared: {squared_array}')

We can also have multidimensional arrays. Best way to think about 2D or 3D arrays is in the form of a matrix. The examples above are one dimensional arrays so a vector:

$$ \mathrm{1D Array} = \begin{bmatrix} 1 \\ 2 \\ 3 \\ .. \end{bmatrix} $$

So a 2D array is like a 2D matrix:

$$ \mathrm{2D Array} = \begin{bmatrix} 1 & 2 \\ 3 & 4 \\ .. & .. \end{bmatrix} $$

In [None]:
# here's an example of a 2D numpy array
array_2D = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Now let's print the 2D array
print(array_2D)

We will very rarely encounter 2D arrays, even though our datasets will often have more than one column. Instead, we'll opt to use a datastructure that makes analyzing multidimensional data easier: dataframes.

## Dataframes

Dataframes are more complicated: think of them as an Excel sheet that you have to control using python. Like Excel sheets, they have column headings and the data is arranged in tabular form. The numbers also behave like a `numpy` array, so you can perform orders of operations between columns.

For this tutorial we will be using a faux datasheet (these numbers are all false!) which is an Excel file that located in the same folder as this Jupyter notebook. We will first open this data using a module known as `pandas`.

Run the cell below and see what happens

In [None]:
# Importing pandas
import pandas as pd

# Opening the excel file as a pandas dataframe and storing it in the variable dataframe
dataframe = pd.read_excel('emissions_data.xlsx', engine='openpyxl')

# Now let's print the first few rows of the dataframe
print(dataframe.head())

You should see some data that showcases some climate and energy data with respect to the year. When we have large datasets we need to easily extract one or two rows or columns quickly for comparison. This is where dataframes are very powerful. Unlike `numpy` arrays, which can only contain numbers, a dataframe can contain words/letters allowing us to have *labelled column headings*. This will let us very easily extract a column using its label. 

The syntax is as follows:

- First identify the variable name of your dataframe
- Copy the *exact* name of the column heading
- Combine the two like this: `dataframe_variable["Column Heading"]`, the column heading has to be in between quotes!

In [None]:
# Let's print out just the temperature column
print(dataframe["Temperature (°C)"])

We can also look at two columns at the same time by separating the column names by a comma and adding an extra square bracket:

`dataframe_variable[["Column 1", "Column 2"]]`

In [None]:
# Let's look at the year and emissions columns
print(dataframe[["Year", "CO2 Emissions (ppm)"]])

Again, we need to make sure that the column name is *exact*. If it's not exact we will get an error!

Now let's say I want to see all the data for the year 2000. The syntax for this is a little more complicated. We'll need to:
- Identify the *variable name* of the dataframe
- Identify the *column heading* that stores all the years (in this case it's "Year")
- Identify the *Year* we want to look at (2000)

Combining all three we get:
`dataframe_variable[datafram_variable["Column Heading"] == Year]`

In [None]:
# Looking at data for a specific year
year_2000 = dataframe[dataframe["Year"] == 2000]

# Now let's print the data for the year 2000
print(year_2000)

We can also add two columns together if we want, I'll show an example but remember that this is inaccurate since each column has a different unit!

In [None]:
# Adding two columns together
adding = dataframe["CO2 Emissions (ppm)"] + dataframe["Energy Usage (TJ)"]

# Print the result
print(adding)

Dataframes are a very powerful datastructure. This is the main datastructure we'll be using, along with arrays.

# Running a linear regression

Now that we know how to do some basic Python syntax and how to load excel data into dataframes, we can take a practice look at how to do a linear regression.

I will go more in detail into the parts of the linear regression in the individual lab modules, but we can look at some simple examples here.

## Create a plot

The first thing we will need to do is plot our data to visualize it. Plotting in Python is done using a module known as `matplotlib` which we will import using it's shortened form `plt`. I will go more into detail about everything we can do to create nice plots in the Lab 8 Jupyter Notebook.

For now, we create a plot by calling `plt.plot` and having the x and y data in parentheses. Therefore, the general format for adding data to a plot looks like this:  `plt.plot(x, y)`

We will walk through it in the code below where we plot the Eutrophication vs the Year:

In [None]:
# Let's import the modules we need
import pandas as pd
import matplotlib.pyplot as plt

# Reading the data from the excel file
dataframe = pd.read_excel('emissions_data.xlsx', engine='openpyxl')

# Let's save the year and eutrophication columns in separate variables
year = dataframe["Year"]
eutrophication = dataframe["Eutrophication (kg P eq)"]

# Now let's plot the data
# So we call matplotlib to plot the year on the x-axis
# and eutrophication on the y-axis
plt.plot(year, eutrophication)

# We will also label the x and y axes
plt.xlabel("Year")
plt.ylabel("Eutrophication (kg P eq)")

# Let's show the plot
plt.tight_layout()
plt.show()

## Linear Regression

Now let's say I want to see how linear this trend is. To do this, we run a *linear regression* and test the goodness-of-fit. We will use a new library called `linregress` to do this. Running a linear regression is also simple, we just call `linregress` and specify the `x` and `y` data:
- `lineregress(x, y) 

This will output a result that we can save as a variable `regression_results`:

In [None]:
# To run a linear regression, we need to import the linregress function from scipy
from scipy.stats import linregress

# Let's run the linear regression
regression_results = linregress(x=year, y=eutrophication)

# If this runs without errors it will print Regression Successful
print("Regression Successful")

So now to see exactly what happened, we will need to find the fitted slope and y-intercept, just like when we fit using Excel. The easiest way to do this is to call `regression_results.slope` and `regression_results.intercept` as both of them are stored in that variable. Additonally, we can also get their errors!

We can also get the $r^2# value that way:

In [None]:
# Let's get the fitted slope and intercept
slope = regression_results.slope
intercept = regression_results.intercept

# And their errors
slope_error = regression_results.stderr
intercept_error = regression_results.intercept_stderr

# And the R squared value
r_value = regression_results.rvalue ** 2

# Let's print the results
print(f'Slope: {slope} ± {slope_error}')
print(f'Intercept: {intercept} ± {intercept_error}')
print(f'r^2: {r_value}')

### Plot the regression
This is great! But we should plot all of this into one plot to actually visualize how good the fit is. So we'll use the slope and intercept to find out fitted y-values and plot it on the same plot as the experimental data. There will be some extra arguments in the code below, but that will be explained in the individual lab modules when you need them.

In [None]:
# First let's find the y values of the linear regression
fitted_values = slope * year + intercept

# Let's plot the data and the linear regression
# The experimental data will be circles and the linear regression will be a line
plt.plot(year, eutrophication, 'o', label="Experimental Data", color='black')
plt.plot(year, fitted_values, label="Linear Regression", color='red')

# Labeling the x and y axes
plt.xlabel("Year")
plt.ylabel("Eutrophication (kg P eq)")

# Let's add a legend to identify the two lines
plt.legend()

# Let's show the plot
plt.tight_layout()
plt.show()

# What is a Function?

Functions are a powerful tool in Python, even if they seem complex at first. They are a block of reusable code that performs a specific task. Rather than writing the same code multiple times, you can "call" a function whenever you need to perform that task. This makes your code cleaner and more efficient.

You *will not* need to know how to implement and use functions on your own for this course. **However** you should know the general structure and components of a function.

Functions take inputs, perform operations, and return outputs. The general structure of a function is:

In [None]:
# Here is the basic structure of a function
def function_name(inputs):
    # Code block that performs some operations
    # for example, squaring the input
    output = inputs ** 2
    return output

## Example: Calculating Averages
Let’s explore an example where functions come in handy: calculating the average of three numbers. We'll start with three diffrent lists with some numbers.

In [None]:
# Make sure to run this cell!!
# Here are a couple of lists with numbers
list_1 = [1, 2, 3, 4, 5]
list_2 = [6, 7, 8, 9, 10]
list_3 = [11, 12, 13, 14, 15]

We'll now manually calculate the average of each list. This is done by adding all the numbers in the list and then dividing by the number of items in the list (i.e. the "length" of the list)

- Adding a list is done by using `sum(list_name)`
- Finding the number of items is done by using `len(list_name)`

In [None]:
# Manually find the average of the lists
average_list_1 = sum(list_1) / len(list_1)
average_list_2 = sum(list_2) / len(list_2)
average_list_3 = sum(list_3) / len(list_3)

# print the averages
print(f'Averages: {average_list_1}, {average_list_2}, {average_list_3}')

Although this isn't too bad, it can get tedious if we have a large number of lists. Or, imagine if you had to do something more complicated than find the average. What we can do instead is define a function that calculates the average since the formula is the same for all of them!

In [None]:
# Let's make a function that calculates the average of a list
# The input will be the list of numbers
def average_list(input_list):
    # The sum of the list divided by the length of the list
    # This is the formula for the average
    average = sum(input_list) / len(input_list)
    # returning the output
    # which is the average of the list
    return average

So now, when we want to calculate the average, all we have to do is call the function and store the output in a new variable!

In [None]:
# Calculate the averages of the lists using the function
average_1 = average_list(list_1)
average_2 = average_list(list_2)
average_3 = average_list(list_3)

# Print the averages
print(f'Averages: {average_1}, {average_2}, {average_3}')

**What you should know:**
- A function takes in inputs
- The function will perform calculations on the inputs and "return" an output
- For this course: functions will be used to define specific formulae. You should know how to use your inputs to write the correct formula inside of a function!

# Common Errors

## Wrong filename or missing extension
In one of the previous sections we used the "emissions_data.xlsx" file. Mow let's try loading in that file but without the ".xlsx"

In [None]:
import pandas as pd
# Let's load in the emission data
dataframe = pd.read_excel('emissions_data', engine='openpyxl')

We get a `FileNotFoundError`! Therefore, when importing files make sure that your filenames are **exact** and always **include the file extension** *e.g.* .jpeg, .csv, etc.

## Arrays of different lengths
So, what happens when we try to add two arrays that are *different lengths*?

Run the cell below and see what happens

In [None]:
# Let's make two lists of different lengths
list_1 = [1, 2, 3, 4, 5]
list_2 = [6, 7, 8]

# Now let's convert the lists to arrays
array_1 = np.array(list_1)
array_2 = np.array(list_2)

# Now let's add the two arrays: what do you expect to happen?
add_array = array_1 + array_2

# Now let's print the added array
print(add_array)

Error message! What we'll focus on is this part:

`ValueError: operands could not be broadcast together with shapes (5,) (3,)`

This means that the addition could not be done since the lengths of the arrays are different.

***Fix:*** Make sure arrays are the same length before any order of operations are performed.

## Wrong Column Name

We'll use the same dataset we used above. Let's try to call the temperature column, but we'll do something slightly wrong and let's see what happens 

In [None]:
# Calling the temperature column from the dataframe above
print(dataframe["Temperature"])

We get a **Key Error**! 

`KeyError: 'Temperature'`

This is because the heading is *Temperature (°C)* **not** *Temperature*. We need to be **exact** with our column headings when calling them.

## Wrong Syntax
The most common issue is wrong syntax! This means that Python does not recognize the code that was written. For example, this can be due to an incomplete variable:

In [None]:
variable =

It can also be due to missing a comma when writing list:

In [None]:
list = [1, 2 3]

Make sure to follow instructions very clearly and you will not face any syntax errors!