Advanced Python Workshop
========
***

Welcome!
-----

This workshop is intended to give you a background in the most commonly-used Python libraries and operations. Here's an outline of the topics we'll cover today:

1) NumPy Tutorial:  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a) Array Creation  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b) Shape Manipulation  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;c) Important Array Operations  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;d) Data Retrieval  
2) Pandas Tutorial  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a) ```Series``` Creation and Manipulation  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;b) ```DataFrame``` Creation and Manipulation  
3) Basic Data Retrieval from Files  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a) CSV Manipulation

Necessary Libraries
-----------------
Two of the most-used libraries in Python are ```numpy``` and ```pandas```. They both offer their own more advanced data structures, and they are almost always used in real-world applications. You will find that plain python without any libraries is almost never used.  

If you haven't imported libraries in Python before, you can think of it like you're using static classes in Java (e.g. Math). When you import a library ```as``` something, you are telling Python the name by which you will reference the library.

In [None]:
import numpy as np #if you import numpy as anything other than np, you will be ostracized from the python community
import pandas as pd

NumPy Tutorial
---
NumPy is a widely-used math and scientific library in Python. Most importantly, it gives us a new data structure that allows us to manipulate data much more easily, and it is used extensively in neural networks and other machine learning techniques.

### Array Creation
The main structure that NumPy gives us is its powerful ```ndarray``` structure. This structure goes far beyond the Python standard structures (which really nobody ever uses), including functionality to reshape multidimensional arrays and higher-level math performance.

Let's begin by creating a 1-dimensional array of 10 integers, 0-9.

In [None]:
arr = np.arange(10) #N.B. the function name is "arange" not "arrange"
print(arr)
print(type(arr))

As we can see above, we have created an array of a type other than Python's standard. You can also create NumPy arrays out of standard lists:

In [None]:
standardArray = [0,1,2,3,4,5,6,7,8,9]
print(type(standardArray))
numpyArray = np.array(standardArray)
print(type(numpyArray))

### ndarray Shape
Now we'll go over what you can do with NumPy arrays. First, we need to define shape. Shape refers to the dimensions of a multidimensional array. For example, let's say I have a 2-D array as follows:

\[\[a b c d\]  
 \[e f g h\]\]
 
We would say that this 2-D array has a shape of (2,4). NumPy allows us to determine the shape of a NumPy array by using the ```shape()``` function, and we can also use the ```reshape()``` function to re-dimension arrays. Let's use these functions to find the initial shape of ```numpyArray``` and reshape it so that it has two rows and five columns.

In [None]:
print("Initial array: " + str(numpyArray) + "\n")
print("Shape of initial array: " + str(np.shape(numpyArray)) + "\n")
numpyArray = numpyArray.reshape(2,5)
print("Final array: ")
print(str(numpyArray) + "\n")
print("Shape of final array: " + str(np.shape(numpyArray)) + "\n")

Let's retrieve some data from this array. Remember that, like in Java, Python goes row, then column. Additionally, Python is generally ***zero-indexed***, meaning that the first element in an array is treated as the "zeroeth" element. Thus, to get the value of the element from the second row and third column, we say:

In [None]:
print(numpyArray[1][2])

You may have noticed that we chose to reshape the array to two rows and five columns because 2\*5 = 10, the number of elements in the array. Let's see what happens when we try to reshape the array to something that doesn't match the number of elements:

In [None]:
numpyArray = numpyArray.reshape(2,6)
print(numpyArray)

Woops, we get an error! Remember, even if you reshape an array such that the final dimensions can hold more values than necessary, you will get an error.

### Other Important Array Operations

Now that we have introduced the concept of shape, everything else about arrays in NumPy is pretty straightforward. 

Most methods for iteration over NumPy arrays are the same as those for standard Python lists. Let's create a new NumPy array of digits 0 to 4 and print them using a for loop.

In [None]:
basicArray = np.arange(4)
for i in basicArray:
    print(i)

One function that is often used is ```zeros()```. This function is very simple: it creates an array filled with zeros given a shape. (There is also another similar function called ```ones()```.) Let's make an 2 x 3 array filled with zeros:

In [None]:
exampleArray = np.zeros((2,3))
print(exampleArray)

Let's say I wanted to increment all the values in ```exampleArray``` by 3. We *could* iterate over the array and add 3 to each value, but instead we should take advantage of the fact that NumPy has ***element-wise operations***. This means that if I perform a function on the array, the function is applied to each element, and the shape of the array is retained.

In [None]:
exampleArray = exampleArray + 3
print(exampleArray)

The same property applies to other arithmetic operations, including subtraction, multiplication, division, and exponentiation.

In [None]:
exampleArray *= 2
print(exampleArray)
exampleArray **= 2
print(exampleArray)

**Sub-Array Data Retrieval**: One interesting feature of Python is the ability to gather sections of data from arrays. For example, let's say I have a single-dimension array of 5 random values between 0 and 10, singleDimension:

In [None]:
singleDimension = np.random.rand(5)*10
print(singleDimension)

I can retrieve a sub-array of values for indices between 2 and 4 as follows:

In [None]:
print(singleDimension[2:4])

Notice that the first index is inclusive and the second is exclusive. You'll see this notation frequently in the future. Another thing to note is that when you want a sub-array that begins at index 0, you can actually just leave the "0" out and it'll work fine:

In [None]:
print(singleDimension[:3])

## Pandas Tutorial

Now that you have an understanding of the NumPy library, we can start using the Pandas library, which builds on NumPy constructs. Much like NumPy offers its primary data structure ```ndarray```, Pandas offers two data structures called ```Series``` and ```DataFrame```. You'll find that ```Series``` is much like NumPy's array, and ```DataFrame``` allows you to observe data in a table-like manner. For those of you who have used the R Programming Language, ```DataFrame``` is similar to the R structure of the same name.

### Series

As mentioned above, ```Series``` is very similar to ```ndarray```. Essentially, a ```Series``` is just a labelled array. You can even create a ```Series``` out of an ```ndarray```:

In [None]:
apcsDiagnosticScores = np.random.rand(6)*50 + 10 #We take advantage of element-wise operations to generate random numbers between 10 and 60
diagnosticScoreAssignments = pd.Series(apcsDiagnosticScores, index=["Alex", "Alice", "Bob", "Katie", "Rajesh", "Ryan"])
print(diagnosticScoreAssignments)

As we saw earlier with ```ndarray```s, ```Series``` also support element-wise operations. For example, let's say we wanted to add a 40% curve given the ~~abysmal~~ less-than-expected scores:

In [None]:
diagnosticScoreAssignmentsWithCurve = diagnosticScoreAssignments + 40
print(diagnosticScoreAssignmentsWithCurve)

In order to retrieve data from ```Series```, we use notation similar to arrays: square brackets. You can retrieve data based either on the identifier (in this case the name of the person to whom the score is attributed) or the index of the person. Take Rajesh for instance. As you can see here, you can retrieve his score by either using his name or the index '3', since this structure is zero-indexed.

In [None]:
print(diagnosticScoreAssignmentsWithCurve['Rajesh'])
print(diagnosticScoreAssignmentsWithCurve[4])

It's also important to note that in ```Series```, you can retrieve data in the same way you can with ```ndarray```s. In fact, you can even use the string identifiers rather than numeric indices; however, unlike using numeric indices, the last value is included. For example:

In [None]:
print(diagnosticScoreAssignmentsWithCurve["Alice":"Rajesh"]) #Rajesh's score is included because we are using the identifier, not the numeric index
print("\n")
print(diagnosticScoreAssignmentsWithCurve[1:4]) #Rajesh's score is excluded because we are using numeric indices

### DataFrame

A ```DataFrame``` is essentially a table that consists of ```Series```. When multiple ```Series``` have the same indices, they can be combined into a table. In order to do this, we need to create a dictionary, which is similar to a Map construct in Java if you've used that before. It's best to see an example:

In [None]:
scoreDictionary = {"Without Curve" : diagnosticScoreAssignments, "With Curve" : diagnosticScoreAssignmentsWithCurve}
scores = pd.DataFrame(scoreDictionary)
print(scores)

The structure of a ```DataFrame``` is essentially a dictionary (or map, if you prefer to think of it like that) of ```Series```, so if we want to retrieve a specific value, we first enter the ```Series``` we want, then we enter the specific element identifier within that series. For example, let's say we wanted to know Katie's uncurved score:

In [None]:
print(scores["Without Curve"]["Katie"])

## Introductory CSV Manipulation Tutorial

### Reading CSV Files

During our first Wednesday Workshop, we asked you to create a folder on your desktop called "AIClub," which is where your Jupyter notebooks should be placed. Please download [this iris csv file](https://drive.google.com/file/d/1xk2ZBOO0j7dI-bhOg5AQ0nBmHjFXz970/view?usp=sharing) and place it in your AIClub folder. Open it with a plain text editor (TextEdit on Mac, Notepad on Windows) to get an idea of its structure. This is the type of structure you'll see when manipulating datasets in machine learning. This particular dataset documents characteristics of types of flowers and is almost the "hello, world" of the machine learning world.

CSV files are comma-delimited data-storage files. The pandas library conveniently has functionality that allows us to read data from CSVs directly into ```DataFrames```:

In [None]:
data = pd.read_csv("iris.csv")
print(data)

As you can see, pandas makes it easy for us to read CSV data! As long as it's formatted correctly, we can convert it directly to a ```DataFrame```. This type of functionality is why many people prefer Python for data analysis and machine learning projects.

### Writing to CSV Files

Writing to CSV files is pretty much as easy as it is to read them. Let's use our diagnostic scores example from earlier. What if I wanted to write the data to a CSV file? Well, it's a very simple function:

In [None]:
scores.to_csv("scores.csv") #Will place output CSV in AIClub folder (or wherever your notebook is located)

Go ahead and make sure that you can now see a ```scores.csv``` file in your AIClub folder. Open it with TextEdit/Notepad to check it out!