# Python Part II: Intro to Pandas

In Python Part I, we flew through introductory Python concepts, everything from the `print()` statement to For Loops, and used that information to write cool little applications like simulating a coin flip. At the end, we covered packages which are importable Python obejcts with defined methods and attributes. We looked at the `math` package, we includes relevant mathmatical functions like `log10()` and `gcd()`. We also looked at the `random` package, with allowed us to create random variables. 

We will now introduce the Pandas package, which is a Python package that allows users easily to manipulate datasets in Python.

*Note: In the walkthrough below, the first instance of each new Pandas method or attribute will be hyperlinked directly to the Pandas documentation for that method/attribute. Feel free to reference that documentation for additional information.*

# Part 1: The DataFrame

First, to start using Pandas, we have to import it. Typically, Pandas is imported as follows

In [None]:
import pandas as pd

The above code imports Pandas and then assigns the package a nickname, in this case `pd`. Therefore, whenever we refer to Pandas we can just type `pd`. 

In the below code, we use Pandas to read in our first dataset, specifically the `sales.csv` file stored in the "Intro to Python - II" repository. 

In [None]:
sales_csv = r"https://raw.githubusercontent.com/cra-international/Intro-to-Python/master/Intro%20to%20Python%20-%20II/cookie_sales.csv"
sales_df = pd.read_csv(sales_csv)

In the above code, we reference the Pandas package using the `pd` notation discussed above. We then call the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function, which has one *required* argument, in this case the path to the file. As the documentation notes, "Any valid string path is acceptable. The string could be a URL." So in this case, we pass a URL to the function which itself points to a CSV file.

Great! So, what was returned by this `read_csv()` function? Let's find out! We can use Python's built-in `type()` function to find out.

In [None]:
print(type(sales_df))

In this case, what was returned was a *DataFrames*, which is a data type we haven't seen before. That's because DataFrames are defined by the Pandas package. The majority of this demo will focus on learning about DataFrames, as this is where all the magic happens. DataFrames not only store data, but they also have *methods* that allow us to manipulate the data easily. We'll cover the methods in more detail, and exactly how DataFrames store data, below. 

### Now you try! 

Create a new dataframe, called `price_df`, that takes the value of the data stored in the `cookie_prices.csv` file. The URL to that file has already been stored in the `price_url` variable below. 

In [None]:
price_url = r"https://raw.githubusercontent.com/cra-international/Intro-to-Python/master/Intro%20to%20Python%20-%20II/cookie_prices.csv"
price_df = pd.read_csv(price_url)

## Part 2: Exploring Data

### <ins>The `.shape` Attribute</ins>

Now that we can read in the data, the next step is to explore the data and see what it looks like. So let's learn our first Pandas function. We can get a quick glimpse of the data by using the [`.head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method, which by default prints the first five observations of the dataset.

In [None]:
sales_df.head()

Here we see that there are 4 columns - Troop Number, Cookie, state, Boxes Sold, and Notes. Based on the looks of it, this looks like Girl Scout Cookie Sales data by troop and cookie type.

We can also look at the shape of the data using the [`.shape`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) attribute. We have not covered attributes up to this point, but an *attribute* is a static value associated with a Python object. Like methods, attributes are accessed using the dot operator, however they are not followed by any parentheses. In this case, the `.shape`, attribute of a Pandas DataFrame returns a *tuple* of the number of observations and number of columns in the DataFrame, in that order. We will talk more about tuples below.

In [None]:
print(sales_df.shape)

Based on this, we can see that the first value of the tuple is `40` and the second is `4`, indicating that there are 40 observations and 4 columns in `sales_df`. Tuples are very simlar to lists as they are ordered and therefore can be accessed using positional indexing. For example, we can get just the number of observations of `sales_df` by doing the following

In [None]:
print("sales_df has", sales_df.shape[0] , "observations.")

The key difference between tuples and lists, however, is that tuples are *not* mutable, meaning you can not change the values of a tuple using positional indexing.

### Now you try!

In the first code cell below, use the `.head()` method to look at the `price_df` variable you created above. What does the `price_df` dataset look like? What kind of data is it? Then, in the second code cell, print the number of columns of the data to the console by using positional indexing to reference the second value of the tuple returned by `.shape`.

In [None]:
price_df.head()

In [None]:
print("price_df has", price_df.shape[1], "columns.")

### <ins>Pandas Data Types</ins>

DataFrames, like other 2-dimensional data storage types such as SQL tables or Stata datasets, defines data types as the *column* level. Therefore, like regular Python, Dataframes also have basic data types we need to learn about. These are 
* int64
* float64
* object
* bool
* datetime64

You should notice some similarities between these data types and the basic Python datatypes. `int64` is Pandas' integer data type, where the 64 denotes how many bits each value is comprised of. Since there are 8 bits in a byte, that means the largest value an int64 can be is $2^{64}$, which is typically plenty large enough. `float64` is simlar, but for floats and `bool` is a column of boolean values. 

How are strings stored? Strings are stored using the `object` data type. It's important to note that while all string columns are stored as `object`s, not all `object` columns are strings. For example, we may have a column of *mixed* values, say strings and numbers. That type of column will be cast as an `object` type, essentially converting the numbers in column to text.

Finally, there is one type of datatype we haven't encountered before, and that's the `datetime64` type. `datetime` objects are built-in Python storage objects that allow for easy creation, manipulation, and storage of dates. Pandas' `datetime64` data type is a vectorized form of of the `datetime` object type which allocates 8 bytes to each `datetime` value in the column. 

So, how do we see what data types specific columns are in our data? We can do that by referencing the [`.dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) attribute of a DataFrame.

In [None]:
sales_df.dtypes

This DataFrame attribute simply returns as text the data type of each column in the associated DataFrame.

Based on the first 5 observations of the data we saw from calling the `.head()` method, it's a bit odd that "Boxes Sold", which looks numeric, is of the `object` type. What's going on? Let's print the whole dataset to see. To do that in a Jupyter Notebook, we simply run a block of code with just the variable name of the DataFrame we want to see, as below

In [None]:
sales_df

As we can see, while "Boxes Sold" is almost always numeric, there are two instances where it takes the value "No Sales". (That seems to be because, based on the "Notes" column, a "Severe Peanut Allergy Impacted Sales".) Those two string values have caused the whole column to be cast as an `object` type. 

There also seem to be some other problems with the data. For example, it looks like there are some duplicates of Troop 179's data at the bottom of the dataset. We will address how to resolve this and the data type problem later on in the walkthrough.


### Now You Try!

Print the data types of the `price_df` dataset below and the print the entire dataset to the console.

In [None]:
print(price_df.dtypes)
price_df

## Part 3: Renaming Columns


The `sales_df` DataFrame has four columns - "Troop Number", "Box", "Sold", and "Notes". Some of these column names, however, are not that informative. For example, we may want to change "Box" to "Cookie Type" and "Sold" to "Boxes Sold". Let's do that by using the [`.rename()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) method, as below and then print the first five observations to see the results

In [None]:
sales_df = sales_df.rename(columns={"Box" : "Cookie Type",
                                    "Sold" : "Boxes Sold"})
sales_df.head()

We can now see that the column "Box" was changed to "Cookie Type" and "Sold" was changed to "Boxes Sold". 

The `.rename()` method can take multiple arguments, but the one most frequently used is the `columns` argument. Note that we haven't seen this type of argument notation before, but specific arguments can be specified using the following syntax `<argument name>=<value>`. Arguments specified in this way have a specific name, *keyword arguments*. All of Pandas' more complex methods rely on keyword arguments, so we will see this notation pop up more as we go along.

In this case, we specify the keyword argument `columns` within the call to the `.rename()` method and pass a *dictionary* as the value of the argument. The dictionary is structured so that the *key* is the existing column name and the *value* is the new column name. This dictionary contains multiple key-value pairs corresponding to two column name changes we intend to make. The result of the `.rename()` method is a new DataFrame with the updated column names. In this case, we update the `sales_df` variable by assigning that new DataFrame to the existing variable `sales_df`. 

### Now You Try!

The price variable in the `price_df` data is currently just called "Dollars". Rename that variable to "Price ($)" to make it more specific/informative. 

*Note: Pandas allows the use of special characters in column names!*

In [None]:
price_df = price_df.rename(columns={"Dollars" : "Price ($)"})
price_df.head()

## Part 4: Indexing Data

How do you select just the data you want to work with? That can be done in Pandas using *indexing*, similar to how we used indexing to select specific values from lists in Intro to Python Part I. We will talk through all of the basics of indexing below as this is a crucial aspect of working with data in Pandas.

## <ins>The Index</ins>

In each print out of the datasets above, there is an additional "column" on the left-hand side of the data that is yet to be explained. This is called the DataFrame *index* and is a unique ID associated with each observation in the dataset. The DataFrame is very similar to a list in the sense that its (default) index is zero-indexed and can be used to access specific data within the Python object. 

Unlike lists, however, observations are accessed with a specific DataFrame attribute called [`iloc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html), which stands for "index locate". Let's see how `.iloc` works below by printing `sales_df.iloc`. 

In [None]:
print(sales_df.iloc)

We see abvove that `sales_df.iloc` is an "iLocIndexer object", an object defined by the Pandas package. (That gobbledy-gook on the far right is the location of the object in memory in hexadecimal). You don't really need to worry much about this, but the most import part is that the "iLocIndexer object" can be indexed in the exact same way as a list. For example, we can access the first 5 observations of a DataFrame as follows

In [None]:
sales_df.iloc[0:5]

Perfect! This gives us information identical to `sales_df.head()`. You can also get multiple specific non-sequential rows by passing a *list* of the desired rows within the square brackets.

In [None]:
sales_df.iloc[[1, 3, 5]]

What happens if you access just one row? Let's see!

In [None]:
print(sales_df.iloc[0])
print()
print(type(sales_df.iloc[0]))

We see that if we print just one observation, we get a print out of the individual values of the row as well as the "Name" or index of the row. If we check the type of the row using the `type()` function, we see that we no longer have a DataFrame but now have a Pandas [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html). 

Pandas stores individual rows or individual columns as Series, which you can think of as complex lists that have very helpful mathematical properties and defined methods. For example, the sum of two Series is not one longer Series, as it would be for lists, but rather the sum of two vectors as defined by vector algebra. 

*Note: You can print an empty line in Python by just calling* `print()`

### Now You Try!

Combine what you just learned with your knowledge of list-based indexing to print the last 20 observations of the `sales_df` dataset.

In [None]:
sales_df[-20:]

### <ins>Column and Combined Indexing</ins>

It's important to note that in a DataFrame, not just observations have indices. Columns have indices as well, it just happens that when we read in our data, Pandas didn't use its default indices (0, 1, 2, 3, etc.), but gaved them *named indices*, in this case the column names stored in the first row of our csv data. Therefore, we can use the columns names to directly access specific columns by name using dictionary-like key indexing, as below

In [None]:
print(sales_df["Boxes Sold"])
print()
print(type(sales_df["Boxes Sold"]))

Like with the individual row we accessed using the `iloc` object, here accessing just one column returns a Series object. Also, like with `iloc` indexing, we can access multiple columns by passing a list of column names as below

In [None]:
sales_df[["Cookie Type", "Boxes Sold"]]

What if you want to index a dataset based of of both columns and observations? Say we want to see only the first 10 observations of the "Cookie Type" and "Boxes Sold" columns? In that case, we'd use the [`.loc`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) indexer, which indexes a DataFrame based on the name of its indices.

In [109]:
sales_df.loc[:10, ["Cookie Type", "Boxes Sold"]]

Unnamed: 0,Cookie Type,Boxes Sold
0,Thin Mints,43
1,Caramel deLites,16
2,Peanut Butter Patties,22
3,Girl Scout S'mores,11
4,Lemonades,34
5,Peanut Butter Sandwich,8
6,Shortbread,38
7,Thanks-A-Lot,26
8,Caramel Chocolate Chip,17
9,Toffee-tastic,29


In the above case, we take advantage of the fact that the *name* of our row/observation index is the same as its position, which is the case by default. Therefore, we specify our doubly-indexed DataFrame by specifying the observations we want first (in this case by specifying a *slice* or range of observations), followed by a comma, followed by a *list* of our desired columns. 

### Now You Try!

Use the `.loc` indexer to access the 1st, and 21st, and 27th observations and just the "Troop Number" and "Notes" columns.

In [115]:
sales_df.loc[[0, 20, 26], ["Troop Number", "Notes"]]

Unnamed: 0,Troop Number,Notes
0,179,
20,177,*A kind soul just dropped off a $20
26,143,*Severe Peanut Allergy Impacted Sales


## <ins>Boolean Indexing</ins>

While good to know, it's sometimes not useful/pragmatic to access indiviual rows of a DataFrame using positional indexing. More often than not, it's more useful to be able to access rows based on a characteristic they share, such a specific value or condition. Indexing a DataFrame in that way is possible using boolean indexing.

For example, you might want to look at all data for Troop 177 Only. Rather identify the position of each observation assocaite with Troop 177 and then selecting observations using positional indexing, we can directly index by doing the following

In [119]:
sales_df[sales_df["Troop Number"] == 177]

Unnamed: 0,Troop Number,Cookie Type,Boxes Sold,Notes
10,177,Thin Mints,117,
11,177,Caramel deLites,84,
12,177,Peanut Butter Patties,39,
13,177,Girl Scout S'mores,33,
14,177,Lemonades,56,
15,177,Peanut Butter Sandwich,12,
16,177,Shortbread,86,
17,177,Thanks-A-Lot,24,
18,177,Caramel Chocolate Chip,45,
19,177,Toffee-tastic,77,


In the code above, rather than specify specific indices associated with the observations we want to isolate, we instead pass a comparison, in this case a comparison of a specific column against a specific value. In Python Part I, we saw that the result of a comparison like `5 < 6` is a boolean value. What does this comparison of a Pandas Series against a value return? 

In [120]:
sales_df["Troop Number"] == 177

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
19     True
20     True
21    False
22    False
23    False
24    False
25    False
26    False
27    False
28    False
29    False
30    False
31    False
32    False
33    False
34    False
35    False
36    False
37    False
38    False
39    False
40    False
41    False
42    False
43    False
44    False
45    False
46    False
47    False
48    False
49    False
50    False
Name: Troop Number, dtype: bool

A comparison involving an individual column returns another Series, in this case a Series of type `bool`. Specifically, this series takes the value `True` whenever the compairson is True and `False` otherwise. Therefore, this Series of boolean values can be used to *boolean index* the dataset as a boolean value of `True` indiciates that we'd like to select the observation. 

### Now You Try!

Use boolean indexing to select just the observations from `sales_df` associated with "Toffee-tastic" cookies.

In [121]:
sales_df.loc[sales_df["Cookie Type"] == "Toffee-tastic"]

Unnamed: 0,Troop Number,Cookie Type,Boxes Sold,Notes
9,179,Toffee-tastic,29,
19,177,Toffee-tastic,77,
30,143,Toffee-tastic,103,
40,212,Toffee-tastic,13,
50,179,Toffee-tastic,29,


### Part 5: Cleaning Data

Up to this point, we have read in data, explored it, renamed columns (or "edited metadata" if you want to be fancy), and learned how to select just the data we want to work with. But now we will cover how you clean data so that we can finally analyze it.

Cleaning data is a broad umbrella, but can include changing variable types, dropping superfluous columns and unneeded observations, removing duplicate data, and much more. We will look into how to do many of these tasks below.

### <ins>Replace Observations</ins>

When we looked at the sales data earlier, we noticed that the "Boxes Sold" column was an `object` type because it had a couple of instances where it took the value "No Sales", therby preventing it from being read in as a numeric variable. We'd like to calculate some statistics based off the number of "Boxes Sold", so we'll need to convert "Boxes Sold" to a numeric variable type, we'll need to either replace these observations with a numeric value. 

Let's used what we learned about i