## Notebook Topic: pandas is the way!

<ins>Learning Objectives</ins>

1. To gain familiarity and comfort with the pandas library

While **numpy** was the favored package for many years for scientific computing (and is still quite commonly used), **pandas** has recently risen to fame.  We will learn the basics in this notebook, and we will get more sophisticated in later notebooks.  I will refer you to the "Resources" Folder on our class website for a **pandas** cheatsheet (as well as a Jupyter and Python  basics cheatsheet).  

**Section I: Series**

In **pandas**, there are two main data structures (specific only to this library).  One of these data structures is called *Series*.  

A Series is a one-dimensional array-like object containing a sequence of values of the *same* type and an associated array of data labels called its index.

<span style="color:purple">Check out the code chunk below to better understand what this definition means.</span>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

M = pd.Series([1, 2, 3, 4])

print('This is the array \n', M.array, 'and this is the index labels \n', M.index)

Series are used because of their indexing feature.  You can select individual data points based on their index.  For example, let's add some unique indexing names for M:

In [20]:
M.index = ["ABBA", "NSYNC", "FLO RIDA", "T SWIFT"]

Now, we can grab information from M based on its index.

In [None]:
M['ABBA']

In [None]:
M[['ABBA', 'FLO RIDA']]

<span style="color:purple">
What do you notice about how the two different lines of code look like and what their output is?
</span>

Answer the question here:


Rather than just grabbing information from a data set, you can build new Series or aggregate two Series based on the index information.

In [None]:
artists = ["Ozzy Osbourne", "NSYNC", "FLO RIDA", "T SWIFT"]

new = pd.Series(M.array, index = artists)

new + M

<span style="color:purple">What do you notice about the output?  What do you think it means?</span>

Answer here:

**Section II: DataFrame**

This may be the most important object type you learn about this semester!  While numpy is great, its advantage is working with numeric-based data sets.  A pandas DataFrame allows an analysis of data sets with mixed data types.

In [None]:
# Here is one way to create a dataframe

data = {"state": ["Ohio", "Ohio", "Ohio", "Nevada", "Nevada", "Nevada"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)

frame.head()

<span style="color:purple">What do you think *.head() does based on the output?</span>

In [None]:
frame.shape

<span style="color:purple">What do you think *.shape does based on the output?</span>

DataFrames have similar features as Series.  For example, we can extract information from a particular column using a column name.

In [None]:
frame.year

**Bonus points**. Gather, in a new code chunk, the population information for the year 2000.  You can use any method you feel comfortable with right now.  You may or may not need to review the first four notebooks.

Let's check out a pre-existing data set (Python has some, and it is sometimes useful to use them rather than building your own).

In [None]:
iris = sns.load_dataset('iris')
iris.___() #fill in the blank with the function we used to examine the first 5 rows of the data

In this text box: <span style="color:purple">describe the data.</span>

DataFrames are relatively easy to subset/splice as we learned in earlier notebooks.

In [72]:
indx = iris.species == 'setosa'
setosa = iris[indx]

We can add or remove columns!

In [None]:
# create data of same size
fake_coln = np.random.normal(loc = 0, scale = 1, size = (___,1)) #fill in the blank with the number of rows of the iris dataset
# create extra column
setosa["fake"] = fake_coln
setosa.head()

In [None]:
del setosa['fake']

setosa.columns

## Conclusion

**Note** Chapter 5 goes into more detail about many more of the methods that work on Series and DataFrames.  

<span style="color:purple">
    
You should go through that chapter a little bit.  Name at least one more thing about DataFrames that you learned there that you didn't learn here.
    
</span>