<a href="https://colab.research.google.com/github/connorgrannis/nch_python_workshop/blob/master/Week2_dataframes_file_navigation_excell_spss.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing Packages**
Similar to other languages (R), you can install and load functions for increased functionality.  While some packages are installed with python, you'll need to manually install others.  There are two main ways of doing this: pip and conda.  pip is more-or-less universal and will work in different coding environments, while conda will only work if you're using Anaconda.  The difference is essentially where you're downloading the package from.  If you're working in Anaconda it's best to use conda if you can: otherwise use pip.


Note: We're going to be using "packages" and "modules" interchangeably.

Let's make sure that the packages we're going to be working with today are installed.  Since we're working in a Google environment focused on learning, these are probably already installed but might not be if you are working outside of Google.

Usually, you'll use the pip/conda install command on the command line/terminal but Google allows you to execute command line statements by prefacing your command with !


```
!pip install numpy pandas
```



# **NumPy**

You can think of numpy as the underlying structure of most of the data you're used to. 

At it's core, numpy works with arrays. If you're familiar with Matlab, think of this as the package that enables matlab-esque functionality to python.  If you're not familiar with arrays, you can think of them as a set of data.  Arrays can have multiple dimensions, which can be confusing when you're just starting out. Thinking about Excel, each row is an array; each column is an array, and you can even think of the entire dataset as an array.

Before we can use numpy, we have to load it.  We only need to do this once per notebook (per session) and it will work in all cells.  Python loads packages using the import command.  Numpy is often given the nickname "np" which is why we write 'import numpy **as** np' (although you can name it whatever you want) to save keystrokes.  This is the most common way to load numpy:


```
import numpy as np
```



Let's play around with numpy arrays so you can get used to it.

In [0]:
import numpy as np

x = [5,4,3,2,1]  # List of numbers
y = np.array(x)  # Using numpy to convert the list to an array
print(type(x))   # Show numlist's type
print(type(y))   # Show numarray's type.

Notice above in line 4 that the command np.array() converts a list to an array.  If we forget to make the input a list, we'll get an error.  For example, the following will not work:



```
y = np.array(5,4,3,2,1)
```



## **How are arrays different than lists?**
While lists and arrays look similar, they behave differently.

In [0]:
# what do they look like?
print(x)
print(y)

What happens if we try to do do multiplication and division with lists and arrays?

In [0]:
print(x*3)  # will replicate the list 3 times
print(y*3)  # will multiply each item in the array by 3

In [0]:
# print(x/3) # will crash. Uncomment the first pound sign to run
print(y/3)  # will divide each item in the arry by 3

We can't multiply or divide lists by each other, but we can multiply arrays by other arrays (as long as they're the same length)

In [0]:
# print(x*x)  # will return an error
print(y*y)

## **Built-in methods for arrays**
Arrays come with a bunch of useful methods to compute common statistics.  For example:
- mean
- max/min
- sum
- standard deviation
- variance

As well has other useful methods that we use all the time:
- shape
- size
- flatten

Your turn!
Complete the code block below to 

In [0]:
print(f'The mean of y is: {}')
print(f'The sum of y is: {}')
print(f'The maximum value in y is: {}')
print(f'The variance of y is: {}')
print(f'The standard deviation for y is: {}')

## **Creating placeholders and constants**
Sometimes we might need to create a 1-dimensional array of either zeros or ones, and numpy makes it really easy

In [0]:
zeros = np.zeros(5)
ones = np.ones(5)

print(zeros, ones)

## **Multidimensional arrays**
So far, we've been looking at one-dimensional arrays.  Our behavioral data is often 2-dimensional arrays, although 3- or 4-dimensional arrays will be common when working with Big Data sets or neuroimaging data.

Let's create an array of animals.  We'll have a column for species, name, number of legs, age, and weight.  We'll need to start with two sets of brackets, because each item in our array is going to be a new entry.

In [0]:
my_pets = np.array([['Dog', 'Moose', 4, 4, 20], 
                    ['Dog', 'Dobby', 2, 3, 67],
                    ['Cat', 'Meow', 4, 7, 13],
                    ['Bird', 'Iago', 2, 9, 3],
                    ['Spider', 'Bob', 8, 1,1]])
print(my_pets)

In [0]:
# let's look at the shape
print(my_pets.shape)

Here, we can see that there are 5 rows and 5 columns.  We have two axes (rows and columns), so it's a 2-dimensional array.

### **Indexing arrays**
We can look at just certain parts of our array by using indexing and slicing, like we learned last week.

In [0]:
# Let's look at just the first row
moose = my_pets[0]
print(moose)

We can use the same indexing and slicing tools that we used with strings last week.  Here's a brief refresher working with `moose`

In [0]:
print(f'The second element of moose is: {moose[1]}')  # remember python indexing starts at 0!
print(f'The last element of moose is: {moose[-1]}')   # remember that the negative means to start at the end and move backwards
print(f'The middle three elements of moose are: {moose[1:-1]}') # remember that slicing does not include the 'stop' position
print(f'These are all the elements of moose: {moose[:]}')       # remember that leaving the 'start' and 'stop' fields blank when slicing will start at the beginning and stop at the end.

### **Indexing Multidemensional arrays**
Since `moose` is just one of our rows, how do we reference the other dimensions?

If we break down our variable `moose`, it becomes clear how it works.  Remember that `moose = my_pets[0]`, so getting the second element of moose could also be written like this: 

```
my_pets[0][1]
```

You can think of this as:

```
array[row][column]
```

which is the same as:

```
array[row, column]
```

In [0]:
print(my_pets[0][1])
print(my_pets[0,1])

### Your turn!

Loop through the array and print the name and speces for each animal.

Your output should be a sentence that looks like this:

```
Moose is a dog.
```

You should use f-strings and string formatting methods like `.title()` and `.lower()` to make your sentence look more natural.

It can be confusing to loop through two lists/arrays at the same time.  Instead, since both of our lists are the same length (5), we can loop through a range of that length

In [0]:
for row in range(len(my_pets)):
  print(row)

Now, we can use this new "row" variable to index the columns at that row.  This is much simplier than trying to nest loops!

In [0]:
# complete the code below:
for row in range(len(my_pets)):
  print(f'{} is a {}')

Great! We're off to a great start, but this is really hard to read. I can't remember the order of our values.  This is where pandas can be really helpful. Let's convert our pet array into a dataframe.  Remember, when we're done it will look and act pretty similar to excel or an R dataframe.

# **Pandas**
Pandas will be very familiar to people who have experience with R.  Pandas can be thought of as an extension to NumPy and offers the ability to add column and index labels, while retaining all the useful methods NumPy could already do.  Pandas also makes loading data from a a variety of sources (csv, excel, webpages, spss, etc) super simple, as we'll see a little later.

Similar to NumPy, Pandas is often given the nickname "pd" and can imported like this: 

```
import pandas as pd
```

Let's load our `my_pets` array into pandas.  In case you took a break, let's re-import numpy and reload our array so you don't have to start from the beginning

In [0]:
# You can skip this if everything is still in memory
import numpy as np

my_pets = np.array([['Dog', 'Moose', 4, 4, 20], 
                    ['Dog', 'Dobby', 2, 3, 67],
                    ['Cat', 'Meow', 4, 7, 13],
                    ['Bird', 'Iago', 2, 9, 3],
                    ['Spider', 'Bob', 8, 1,1]])

In [0]:
import pandas as pd

df = pd.DataFrame(my_pets)
print(df)

What we do here is use our newly imported `pd` and use a function from the package: `DataFrame()` to put our `my_pets` array into a neatly formatted Pandas dataframe. 

Note the `.` between `pd` and `DataFrame`, this is an important formatting framework and means that the DataFrame function is from the pandas module.

This is already a much nicer layout.  While pandas makes arrays/dataframes a LOT nicer to look at and work with, remember that the data is still numpy arrays

In [0]:
print(type(df))             # entire dataframe
print(type(df[0]))          # first column
print(type(df[0].values))   # all the values

### **Columns and Series**
Above, you'll notice that the entire array is called a **dataframe** and the columns are called **Series**.

We can easily look at different parts of a dataframe, like the columns:

In [0]:
print(df.columns)

This is saying that our columns are currently just the numnbers 0-4.  Let's add some column headers so we can keep our variables straight.

In [0]:
df.columns = ['Species', 'Name', 'Number_of_legs', 'Age', 'Weight']
print(df)

### **Setting the index**
Now, let's set the index to something a bit more useful:

In [0]:
df.set_index('Name', inplace=True)  # inplace=True is the same as df=df.set_index('Name')
print(df)

### **Referencing specific rows**
We can easily reference rows in pandas dataframes by using either `df.loc` or `df.iloc`.


#### Index VALUE


```
df.loc['value']
```

is telling python to grab the row in the dataframe where the index location is equal to a certain value.

In [0]:
# look at the first row based on index VALUE
print(df.loc['Moose'])

#### Index LOCATION


```
df.iloc[0] 
```

is telling python to grab the row in the dataframe at the indicated index location (in our case at index '0').

In [0]:
# look at the first row based on index LOCATION
print(df.iloc[0])

### Your turn!

Use `df.loc` to print how many pounds Moose weighs.  Hint: you'll have to use both single and double quotes.

In [0]:
# How many pounds does Moose weigh?
print(f'Moose weighs {} pounds.')

### Referencing specific columns
Referencing columns/series is just as simple as referencing rows.

### Column VALUE

In [0]:
# look at the first column based on header
print(df['Species'])

### Column LOCATION
This is slightly more involved because we'll have to slice the dataframe. 

First, let's get the column we want

In [0]:
print(df.columns[0])

Now we can slice the dataframe based on the output of `df.columns[0]`

In [0]:
df[df.columns[0]]

### **Value types**
Sometimes you might have to explicitly define your variable types.  When we check below, we'll find that all of our data are strings.  Luckily, it's easy to change the datatype of entire columns.

In [0]:
print(type(df['Age'].values[0]))
print(df.Age.values)

In [0]:
# Let's convert some of the columns to float instead of string
df['Age'] = df['Age'].astype('float')
df['Number_of_Legs'] = df['Number_of_legs'].astype('float')
df['Weight'] = df['Weight'].astype('float')

print(type(df['Age'].values[0]))
print(df.Age.values)

### Your turn!
Create a 'floatify' function that will loop over all the columns and convert the ones with numeric values to floats.  We're going to have to use some error handling for this function.  We'll come back to this idea in the future, but for now here are the basics:



```
try:
  # code to try
except:
  # if there's an error above it will be piped to this condition instead of crashing
```

We can be more specific as well and only accept certain types of errors.  An IOError is an input/output error. 

YOU SHOULD ALWAYS HAVE A SPECIFIC EXCEPTION AFTER YOUR EXCEPT!

If you were to use the generic `except:` without an exception it prevents you from catching any errors in your code during this block. This is bad coding practice, be specific with your except blocks.



```
try:
  # code to try
except IOError:
  # only execute this if the code in the try statement failed
  # specifically because of an IOError
```

If you want to catch more than one kind of error with your `try:... except:` blocks, you can use parentheses to contain the exceptions you want to handle:


```
try:
  # code to try
except (IOError, ValueError):
  # only execute this if the code in the try statement failed
  # specifically because of an IOError OR a ValueError
```

If you're not sure what Exception to include after your `except`, there's a few things you could try.

First you could consult this website: [here](https://www.tutorialsteacher.com/python/error-types-in-python) which goes into detail on a number of exceptions and theor use cases.

You could also try and run your code without the `try...except` and see what error it gives you, if that's the error you were trying to catch then you could reformat your code to use it in the `try...except` block.

In [0]:
def floatify(dataframe):
  """
  This is called a doc string.
  It's a place for you to define what your function does
  And what your input and output will look like
  """
  for series in dataframe.columns:
    try:
      # insert your code here
      
    except ValueError:
      print(f'Could not convert {series} to float')
  return dataframe

Now that we know the basics of NumPy and Pandas, let's load in some real data.  But how do we do that?  First, we'll have to learn how to navigate through the computer to find the file we want to load.

# **System Navigation**
Python has a built-in package that allows the user to navigate through their operating system. To access these functions, just type `import os`. 

**This will make much more sense if you are using Anaconda on your local computer instead of using Google Colab**.  

If you still need to download Anaconda you can do that [here](https://www.anaconda.com/distribution/).

When we use Google Colab, we're actually on one of Google's computers, which makes it difficult to access local files, especially in a group setting like this, unless they're stored in your Google Drive.  We're going to go through the syntax, but we **strongly** encourage you to play with this on your own computer.

A final note: a "directory" refers to a "folder"

### **System syntax**
A quick note about different operating systems:

For most of you, your work computers will be using Windows 10, which formats it's file paths like this

```
"C:\Users\username\"
```

As we learned last week, `\` is a special character which "escapes" the following character.  There are two ways to get around this:

You can double the `\`
```
"C:\\Users\\username\\"
```
or you can preceed your quotes with `r`. This tells python to use the raw text and treat it exactly as it's typed
```
r"C:\Users\username\"
```

Some of you might be using Macs or Linux computers, both of which use `/` instead of `\`.  Since `/` is not a special character, there is no need for special treatment.

### **os commands**
Below, we'll go through some commands that will be useful for navigating through your folders.

In [0]:
# First, we need to import os
import os

*Notice we didn't give `os` a "nickname" when we imported it like we did for NumPy and Pandas. `os` is only a few letters so there was no need to assign it a short name to refrence it by. Conversely there was no reason we had to assign pd and np to Pandas and NumPy when importing them, it just makes typing them easier.*

### **Create data**
In order for us to learn how these commands work, let's download some common data.  The following command will make a temporary copy of the github we're working with into Google Colab. When you refresh the page, you'll have to run this line again to get it back into your workspace.

In [0]:
!git clone https://github.com/connorgrannis/nch_python_workshop.git

Did you notice the `!` in front of `git`? Remember this means that this isn't a Python command, and instead allows Google Colab to know this is a command prompt command.

#### **Get current working directory**

In [0]:
print(os.getcwd())

#### **Display the contents of a directory**

In [0]:
print(os.listdir())

In [0]:
print(os.listdir('nch_python_workshop'))

#### **Change directories**

In [0]:
os.chdir('nch_python_workshop/Data')
print(os.getcwd())

In [0]:
for file in os.listdir():
  print(file)

All of these things allow you to access and move to the correct directories we need to before manipulating or creating any files.

# **Reading in Files**
A huge benefit to working with Pandas is its ability to load in different types of data files.  Pandas can easily convert .csv and .xlsx files into a dataframe using the `pd.read_csv` and `pd.read_excel` funtions.

### **Loading Excel files**

In [0]:
import pandas as pd
excel = pd.read_excel('example1.xlsx')
print(excel)

### **Be careful!**
By default pandas assumes that your file has column headers. If that's not the case, make sure to add `header=None` as an argument to the `read_excel` function.

In [0]:
excel = pd.read_excel('example1.xlsx', header=None)
excel

Now you can add column and index labels just like before

In [0]:
excel.columns = ['Species', 'Name', 'Number_of_legs', 'Age', 'Weight']
excel.set_index('Species', inplace=True)
excel

### **Loading csv files**

In [0]:
csv = pd.read_csv('example2.csv')
csv.head()

### **Loading files from SPSS**

In [0]:
!pip install savReaderWriter

from savReaderWriter import SavReader

with SavReader('example3.sav', ioUtf8 = True) as reader:
      spss = pd.DataFrame(reader.all(), columns = reader.header)

In [0]:
print(spss.head())

There's a lot of stuff going on here. Let's break it down:
- After we installed `savReaderWriter` we import it in a way we havn't seen yet: `from savReaderWriter import SavReader`, this means that from the main package `savReaderWriter` we only want to import the function `SavReader`. This is nice when we know we only want to use one or two functions from a package, that way we don't have to type out the full `savReaderWriter.SavReader()` to use the function.

- SPSS files are encoded using utf-8, so we're going to set that to true.  Let's see what that looks like:

In [0]:
temp_df = savReaderWriter.SavReader('example3.sav', ioUtf8 = True)
print(temp_df.head())
print(type(temp_df))
print(temp_df.shape)
print(type(temp_df[0]))

- This is the data that goes into the dataframe

In [0]:
df = pd.DataFrame(temp_df.all())
print(df.head())

- Setting the column headers

In [0]:
df.columns = temp_df.header
print(df.columns)

We can make this into a function that might come in handy in the future:

In [0]:
def open_spss(filename):
  with savReaderWriter.SavReader(filename, ioUtf8=True) as reader:
    return pd.DataFrame(reader.all(), columns=reader.header)

df = open_spss('example3.sav')
print(df.head())

# **Recap: What have we learned?**
- Downloading and importing packages
  - The three ways of importing packages
- Navigating directories
  - How to display dirtectory contents
- Numpy
  - Numpy arrays and functions
  - Numpy indexing and slicing

- Pandas
  - How to load data into Pandas
  - Pandas indexing and slicing
  - How to drop rows and columns
  - How to convert Pandas Series data types

# **Practice 1: Numpy**
1. Create a numpy array of shape 2x4
2. Transpose this array
3. Create a 1D array of length 9
4. Use `np.reshape` to convert this to a 3x3 array
5. Add this array to itself (numerically)
6. Add this array to itself (concatenate)

Advanced:
1. Create a symmetric 10x10 array of random values using `np.array` (reference 'Mocker' from last week on how to use random values)
2. Set the diagonal to 1
3. Return the upper or lower triangular as a 1D array

In [0]:
# If you want to, you can write practice 1 here. Or you can try and write it out using an IDE of your choice. (spyder is one that comes with Anaconda)

# **Practice 2: Pandas**
1. Convert one of the arrays from above into a Pandas dataframe and add reasonable index and column labels
2. Load the titanic "train" dataset into a Pandas dataframe from [kaggle]('https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv').
3. Drop the "Name" column

Advanced:
1. Using only Pandas, plot a scatter plot of `Age` and `Fare`. Here's a link to the function you'll need: [Pandas.DataFrame.plot](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.plot.html)
2. Attach a new row of data to the Titanic set
3. Attach a new column of data to the titanic set
4. Use the pandas `groupby` method to display the mean age of the survivors and non-survivors

In [0]:
# If you want to, you can write practice 2 here. Or you can try and write it out using an IDE of your choice. (spyder is one that comes with Anaconda)