# Python Part II: Intro to Pandas

In Python Part I, we flew through introductory Python concepts, everything from the `print()` statement to For Loops, and used that information to write cool little applications like simulating a coin flip. At the end, we covered packages which are importable Python obejcts with defined methods and attributes. We looked at the `math` package, we includes relevant mathmatical functions like `log10()` and `gcd()`. We also looked at the `random` package, with allowed us to create random variables. 

We will now introduce the Pandas package, which is a Python package that allows users easily to manipulate datasets in Python.

*Note: In the walkthrough below, the first instance of each new Pandas method or attribute will be hyperlinked directly to the Pandas documentation for that method/attribute. Feel free to reference that documentation for additional information.*

# Part 1: The DataFrame

First, to start using Pandas, we have to import it. Typically, Pandas is imported as follows

In [32]:
import pandas as pd

The above code imports Pandas and then assigns the package a nickname, in this case `pd`. Therefore, whenever we refer to Pandas we can just type `pd`. 

In the below code, we use Pandas to read in our first dataset, specifically the `sales.csv` file stored in the "Intro to Python - II" repository. 

In [33]:
sales_csv = r"https://raw.githubusercontent.com/cra-international/Intro-to-Python/master/Intro%20to%20Python%20-%20II/cookie_sales.csv"
sales_df = pd.read_csv(sales_csv)

In the above code, we reference the Pandas package using the `pd` notation discussed above. We then call the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function, which has one *required* argument, in this case the path to the file. As the documentation notes, "Any valid string path is acceptable. The string could be a URL." So in this case, we pass a URL to the function which itself points to a CSV file.

Great! So, what was returned by this `read_csv()` function? Let's find out! We can use Python's built-in `type()` function to find out.

In [34]:
print(type(sales_df))

<class 'pandas.core.frame.DataFrame'>


In this case, what was returned was a *DataFrames*, which is a data type we haven't seen before. That's because DataFrames are defined by the Pandas package. The majority of this demo will focus on learning about DataFrames, as this is where all the magic happens. DataFrames not only store data, but they also have *methods* that allow us to manipulate the data easily. We'll cover the methods in more detail, and exactly how DataFrames store data, below. 

### Now you try! 

Create a new dataframe, called `price_df`, that takes the value of the data stored in the `cookie_prices.csv` file. The URL to that file has already been stored in the `price_url` variable below. 

In [42]:
price_url = r"https://raw.githubusercontent.com/cra-international/Intro-to-Python/master/Intro%20to%20Python%20-%20II/cookie_prices.csv"
price_df = pd.read_csv(price_url)

## Part 2: Exploring Data

Now that we can read in the data, the next step is to explore the data and see what it looks like. So let's learn our first Pandas function. We can get a quick glimpse of the data by using the [`.head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method, which by default prints the first five observations of the dataset.

In [36]:
sales_df.head()

Unnamed: 0,Troop Number,Box,Sold,Notes
0,179,Thin Mints,43,
1,179,Caramel deLites,16,
2,179,Peanut Butter Patties,22,
3,179,Girl Scout S'mores,11,
4,179,Lemonades,34,


Here we see that there are 4 columns - Troop Number, Cookie, state, Boxes Sold, and Notes. Based on the looks of it, this looks like Girl Scout Cookie Sales data by troop and cookie type.

We can also look at the shape of the data using the [`.shape`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) attribute. We have not covered attributes up to this point, but an *attribute* is a static value associated with a Python object. Like methods, attributes are accessed using the dot operator, however they are not followed by any parentheses. In this case, the `.shape`, attribute of a Pandas DataFrame returns a *tuple* of the number of observations and number of columns in the DataFrame, in that order. We will talk more about tuples below.

In [37]:
print(sales_df.shape)

(40, 4)


Based on this, we can see that the first value of the tuple is `40` and the second is `4`, indicating that there are 40 observations and 4 columns in `sales_df`. Tuples are very simlar to lists as they are ordered and therefore can be accessed using positional indexing. For example, we can get just the number of observations of `sales_df` by doing the following

In [38]:
print("sales_df has", sales_df.shape[0] , "observations.")

sales_df has 40 observations.


The key difference between tuples and lists, however, is that tuples are *not* mutable, meaning you can not change the values of a tuple using positional indexing.

### Now you try!

In the first code cell below, use the `.head()` method to look at the `price_df` variable you created above. What does the `price_df` dataset look like? What kind of data is it? Then, in the second code cell, print the number of columns of the data to the console by using positional indexing to reference the second value of the tuple returned by `.shape`.

In [40]:
price_df.head()

Unnamed: 0,Cookie,Price
0,Thin Mints,4
1,Caramel deLites,4
2,Peanut Butter Patties,4
3,Girl Scout S'mores,4
4,Lemonades,4


In [26]:
print("price_df has", price_df.shape[1], "columns.")

price_df has 2 columns.


## Part 3: Cleaning Data


The `sales_df` DataFrame has four columns - "Troop Number", "Box", "Sold", and "Notes". Some of these column names, however, are not that informative. For example, we may want to change "Box" to "Cookie Type" and "Sold" to "Boxes Sold". We can do that by using the [`.rename()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) method, as below and then print the first five observations to see the results

In [41]:
sales_df = sales_df.rename(columns={"Box" : "Cookie Type",
                                    "Sold" : "Boxes Sold"})
sales_df.head()

Unnamed: 0,Troop Number,Cookie Type,Boxes Sold,Notes
0,179,Thin Mints,43,
1,179,Caramel deLites,16,
2,179,Peanut Butter Patties,22,
3,179,Girl Scout S'mores,11,
4,179,Lemonades,34,


We can now see that the column "Box" was changed to "Cookie Type" and "Sold" was changed to "Boxes Sold". 

The `.rename()` method can take multiple arguments, but the one most frequently used is the `columns` argument. Note that we haven't seen this type of argument notation before, but specific arguments can be specified using the following syntax `<argument name>=<value>`. Arguments specified in this way have a specific name, *keyword arguments*. All of Pandas' more complex methods rely on keyword arguments, so we will see this notation pop up more as we go along.

In this case, we specify the keyword argument `columns` within the call to the `.rename()` method and pass a *dictionary* as the value of the argument. The dictionary is structured so that the *key* is the existing column name and the *value* is the new column name. This dictionary contains multiple key-value pairs corresponding to two column name changes we intend to make. The result of the `.rename()` method is a new DataFrame with the updated column names. In this case, we update the `sales_df` variable by assigning that new DataFrame to the existing variable `sales_df`. 

### Now You Try!

The price variabl`price_df` data has a similar column to the sales data called "Cookie". Rename that columns to "Cookie Type" 