# Python Basics 2

Pandas is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of _tabular_ data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

Guidance for this lesson: https://github.com/rabernat/research_computing_2018/blob/master/content/Lectures/pandas.ipynb

Mad Pandas help: https://chrisalbon.com

Guidance for Git: http://swcarpentry.github.io/git-novice/

Documentation for visualizations: https://hvplot.pyviz.org/user_guide/index.html


In [None]:
import numpy as np
import pandas as pd
import hvplot.pandas
from matplotlib import pyplot as plt
%matplotlib inline

## Basic Math on Arrays

The goal of these next few questions are for you to demonstrate the ability to do basic math on arrays. 

In [None]:
#Question 4. Add a cell and use it to create a matrix:

matrix = np.array([[1,2,3],
                 [4,5,6],
                 [7,8,9]])

#What command would you use to calculate the mean of this array using numpy? 

np.mean(matrix)

## Pandas Data Structures: Series

A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an index. The index contains the labels that we use to access the data.

In [None]:
#Question 7.
#1. Add a code cell below that creates the following series:

names = (['Aida','Josh','Jordan'])
values = ([36,37,2.7])
ages = pd.Series(values,index = names)
ages

In [None]:
#2. In the next cell create a bar graph from the series:

ages.plot(kind='bar')

## Pandas Data Structures: DataFrame

There is a lot more to Series, but they are limit to a single "column". A more useful Pandas data structure is the DataFrame. A DataFrame is basically a bunch of series that share the same index. It's a lot like a table in a spreadsheet.
Below we create a DataFrame.

In [None]:
#Question 8. First we create a dictionary:

data = {'age':[36,37,1.7],
       'height':[180,155,90],
       'weight':[78,np.nan,11.3]}
df = pd.DataFrame(data,index = ['Ryan','Chiara','Johnny'])
df

In [None]:
#Question 9. If we make a calculation using columns from the DataFrame, it will keep the same index:

df.weight/df.height

#What is Ryan's calculated density to three decimal places?

In [None]:
#Question 10. Create a new index using a boolean series:

df['is_adult'] = df.age > 18
df

#Which of our participants returns a "False" in our new DataFrame

In [None]:
#Modifying Values: 
#We often want to modify values in a dataframe based on some rule. To modify values, we need to use .loc or .iloc
#Question 11. #Here is an example: 

df.loc['Johnny','height'] = 95
df.loc['Ryan','weight'] += 1
df

In [None]:
#If we use the iloc command, what syntax would allow us to similarly change our entry for Johnny's height?

df.iloc[2,1] = 100

#If we use the iloc command, what could would allow us to similarly add 1 to Ryan's weight? 

df.iloc[0,2] += 1

df

## Plotting

DataFrames have all kinds of useful plotting built in.  Review the plotting documentation for hvplot (https://hvplot.pyviz.org/user_guide/Introduction.html) and then plot some data from our data frame.  

In [None]:
#Question 12. Add a new code box and build your first plot:

df.hvplot(x = 'age',y = 'height', 
          kind = 'scatter', grid = True)

In [None]:
#How does it look? 
#I think in this case it might be helpful to specify the limits on our x and y axis.
#Let's modify our code by adding those limits:

df.hvplot(x = 'age',y = 'height', 
          kind = 'scatter', grid = True,
         xlim = (0,40), ylim = (0,200))


In [None]:
#Question 13. Now read the documentation for hvplot and add the variable 'weight' to your plot. 

df.hvplot(x = 'age',y = ['height','weight'],
          kind = 'scatter', grid = True,
         xlim = (0,40), ylim = (0,200))


In [None]:
#Question 14. Read the hvplot documentation and create this bar graph and upload your results:

df.hvplot(x = 'index',y = ['age','height','is_adult','weight'], c = ['blue','darkorange','red','green'],
          kind = 'bar', grid = False,
         xlim = (0,40), ylim = (0,200))