# Basic `pandas`

In this notebook we will learn about `pandas` a vital data management package in python.

By the end of this notebook you will know about:
- `pandas` `Series` and `DataFrame` objects,
- Indexing `pandas` objects,
- Reading files with `pandas`,
- Saving files with `pandas` and 
- Some helpful `pandas` functionality.

## `pandas`

`pandas` is one of the most popular data handling packages in `python`. We will cover the minimum you need to know about the package for the boot camp in this notebook.

Let's start by importing the package. Just like in the previous notebook this should check whether or not you have `pandas` installed on your machine. If you are using Anaconda, <a href="https://www.anaconda.com/">https://www.anaconda.com/</a>, `pandas` should be installed already. If not check out the documentation for installation instructions <a href="https://pandas.pydata.org/docs/getting_started/install.html">https://pandas.pydata.org/docs/getting_started/install.html</a>.

In [None]:
## it is standard to import pandas as pd
import pandas as pd

In [None]:
## let's check what version of numpy you have
## when I wrote this I had version 1.3.5
## yours may be different
print(pd.__version__)

If you had a version of `pandas` installed, both of those code chunks should have executed without error. If not, you will need to install it onto your machine because we will be using it heavily in the boot camp. If you are unsure how to install a python package in general check our python package installation guide, <a href="https://www.erdosinstitute.org/data-science">https://www.erdosinstitute.org/data-science</a>.

##### Be sure you can run both of the above code chunks before continuing with this notebook, again it should be fine if your package version is slightly different than mine.

### `Series` and `DataFrame`s

`pandas` has two main data structures: 
1. `Series` objects, <a href="https://pandas.pydata.org/docs/reference/api/pandas.Series.html">https://pandas.pydata.org/docs/reference/api/pandas.Series.html</a> and 
2. `DataFrame` objects, <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html</a>. 

Let's explore them below.

In [None]:
## We can turn a list into a series
## with pd.Series()
print("Below is a list.")
print([0,1,2,3], type([0,1,2,3]))
print()
print("Below is a pandas Series object.")
print(pd.Series([0,1,2,3]), type(pd.Series([0,1,2,3])))

The second thing we printed was a `Series` object. Note the two columns of numbers. The first column is the index of the object, the second column contains the values of the object. We can access those two separately like below.

In [None]:
## The index
pd.Series([0,1,2,3]).index

In [None]:
## The values
pd.Series([0,1,2,3]).values

In [None]:
## You practice
## Take the list labeled a and 
## turn it into a Series named b
a = [5,2,3,6,'a','b','e',True,False]





In [None]:
## We can get select entries from a Series 
## using the index
b[3]

In [None]:
## Note that we don't have to have our index be a list of numbers
c = pd.Series([1,2,3,4], index=['a','b','c','d'])

print(c)

In [None]:
## and we can access by index in the same way
c["a"]

Now let's check out a `DataFrame`. `DataFrame`s are essentially a collection of `Series` with a common index. We can also think of them as a table of values with column and row labels.

In [None]:
## We can make a DataFrame using a dictionary
## the dictionary keys are the column labels
## the dictionary values are columns
df = pd.DataFrame({'one':[3,4,5,2,4,5], 
                    'two':['a','b','e','h','l','p'],
                    'third column':[7,7,7,7,7,7]})

## NOTE that this is not the only way to make 
## a dataframe!

print(df)

This is a `DataFrame`, the unlabeled column is the index, the labeled columns are `Series` objects themselves. We can access them in the following way.

In [None]:
## If your column's name doesn't violate a couple of format rules
## you can use
## df.column_name
print(df.one) 
print()
print("Note that each column is a Series object.")
print(type(df.one))

In [None]:
## However if our column name has spaces or certain characters in it
## (like . , ! ? "" and so on)
## we can't use .column_name
df.third column

In [None]:
## So we have to use df["column name"] instead
# or df['column_name']
print(df['third column']) 
print()
print(type(df['third column']))

In [None]:
## Just like with a Series we can use .index
## to get the index
df.index

In [None]:
## .values returns a 2-D numpy ndarray with our columns
df.values

In [None]:
## You code
## Make a data frame, call it my_df 
## Make the first column labeled 'first' from a
## Make the second column labeled 'second' from b
## see what happens when you add , index=range(10,10+len(a)) 
## after the dictionary
a = [4,5,3,4,5,6,0]
b = ['a','c','d','g','l','m','p']





#### Locating `DataFrame` Entries With the Index

Locating entries in a `DataFrame` is slightly more complicated than what we had to do for `Series` objects.

In [None]:
## You code
## try to index the dataframe df like you would
## a list or Series to find the entry corresponding
## to index 1


## What happens?

In order to get particular `DataFrame` entries you have to use one of two methods, `.loc[]` or `.iloc[]`.

In [None]:
df

In [None]:
## for example we can get the 1 row of df like so with loc
df.loc[1]

In [None]:
## or like so with iloc
df.iloc[1]

The difference between `.loc` and `.iloc` is that `.iloc` is restricted to integer based indexing, while `.loc` is more versatile.

This means that if your index is a list of strings you cannot use `.iloc` with just the index's normal value. For example try to run the following.

In [None]:
## Recall what c is
c

In [None]:
## now try to get the ath entry using .iloc
c.iloc["a"]

In [None]:
## instead we'd have to notice that the "a" entry is
## the 0 entry if the index were a normal integer based index
c.iloc[0]

In [None]:
## However, we can just insert "a" into .loc
c.loc["a"]

Another difference is that `.loc` can be used to perform boolean indexing (eg. find me all the rows where the first column is between $2$ and $4$)

In [None]:
df

In [None]:
## give me all the rows where the "one" column is between 2 and 4
df.loc[(df.one < 4) & (df.one > 2)]

In [None]:
## we can even go one step further and find a specific column
df.loc[(df.one < 4) & (df.one > 2),'third column']

In [None]:
## or a subset of many columns
## in this case you enter the subset as a list of column name strings
df.loc[(df.one < 4) & (df.one > 2),['two','third column']]

### Reading in a `.csv` file with `pandas`

You can also read in common data file types with `pandas`. Let's load the following.

In [None]:
## You can read in a csv with
## pd.read_csv("filename.csv")
## https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html

## We'll work with the following dataframe in the next section
jr_shots = pd.read_csv("JR_Smith_Shots_2015_16.csv")

In [None]:
jr_shots

In [None]:
## You code
## read in the file "beer.csv" using pandas
## call it beer_df


## you'll practice with this dataframe in the next section

There are other useful `read` functions like `read_table` and `read_json`, you will see those in the `pandas` practice problems.

### Helpful `pandas` functions

`pandas` offers some really nice built in functions to help you explore any data set you're dealing with. Let's explore them below.

In [None]:
## df.head(n) let's you inspect the first n rows of the dataframe
## n defaults to 5
## https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html
jr_shots.head()

In [None]:
## You code 
## investigate what .tail(n) does 
## using the beer_df
## https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html
## Hint: n must be an integer



In [None]:
## You code
## investigate what .sample(n) does
## using the beer_df
## https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html
## Hint: n must be an integer



In [None]:
## You code
## what happens when you add the argument
## random_state=440 in .sample()?
## rerun the code chunk multiple times with or without the random_state
## try different numbers for the random_state




In [None]:
## df.info() tells you useful information about your
## dataframe, like the column names,
## the number of rows,
## the number of non-empty entries for each column,
## the data type of each column
## and how much memory the dataframe uses
## https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html
jr_shots.info()

In [None]:
## df.describe()
## provides the 5 stat breakdown of each numeric column
## that is any column that consists of integers or floats
## will have its, minimum, maximum, first quartile, median,
## third quartile, and mean provided
## https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html
jr_shots.describe()

In [None]:
## You code
## Try out .mean() and .max() and .min()
## on beer_df
## Compare the results to .describe()'s output

## .mean() - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mean.html
## .max() - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.max.html
## .min() - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.min.html





In [None]:
## What about variables that aren't numeric? Like
## classes
## You can use df.value_counts() for those
## https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html
jr_shots.SHOT_MADE_FLAG.value_counts()

The above tells us that 497 observations have a `SHOT_MADE_FLAG` of 0 and 352 have a `SHOT_MADE_FLAG` of 1.

In [None]:
## You code
## How many of each "Beer_Type" are there in beer_df?


## What happens when you input the argument normalize=True into value_counts()?



One task that we will do quite often in the boot camp is split one data set into two non-overlapping data sets. Let's see how we can do that here.

In [None]:
## Let's first randomly sample 100 observations from jr_shots
## Note the .copy()
## this creates a deep copy of the DataFrame 
## instead of a shallow copy (see notebook 5. Shallow and Deep Copies)
jr_1 = jr_shots.sample(100).copy()

In [None]:
## To create jr_2 we'll just drop the indices of jr_1 from
## jr_shots using .drop()
## https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html

## Again we include .copy() to make a hard copy of the dataframe
jr_2 = jr_shots.drop(jr_1.index).copy()

In [None]:
jr_1

In [None]:
jr_2

In [None]:
## Let's check that it worked like we thought it would
len(jr_1) + len(jr_2) == len(jr_shots)

In [None]:
## You code
## split beer_df into beer_1 and beer_2
## beer_1 should be a random sample of 150 observations
## beer_2 should be all rows not in beer_1



### Saving a `DataFrame` to File

Just like we easily read in a data file we can use `pandas` to quickly save data to a file.

This is done with the function `df.to_csv()`, <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html">https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html</a>.

In [None]:
## Let's first make a data frame
df = pd.DataFrame({'a':[1,2,3],'b':[2,4,6],'c':[17,34,51]})

df

In [None]:
## Now we call df.to_csv(file_name)
df.to_csv("our_first_dataframe.csv")

Go ahead and check your repository, you should now see `our_first_dataframe.csv` in there.

In [None]:
## You code
## read in "our_first_dataframe.csv" using read_csv
## then look at the df




## is anything off?

When you ran `to_csv()` the default is to record the index as well. That is why your `DataFrame` has an unlabeled column when you read it back in using `read_csv`. If we wanted to avoid writing the index to file you include the argument `index=False`. 

In [None]:
## Make the DataFrame again
df = pd.DataFrame({'a':[1,2,3],'b':[2,4,6],'c':[17,34,51]})

df

In [None]:
## write it to file without the index
df.to_csv("our_first_dataframe.csv", index=False)

In [None]:
## You code
## read in "our_first_dataframe.csv" using read_csv
## then look at the df again




That's all you'll need to know from `pandas` to have a firm footing for our boot camp. We will learn more stuff as we go along, but when that time comes you should be ready!

--------------------------

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2022.

Any potential redistributors must seek and receive permission from Matthew Tyler Osborne, Ph.D. prior to redistribution. Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)