# Python Part II: Intro to Pandas

In Python Part I, we flew through introductory Python concepts, everything from the `print()` statement to For Loops, and used that information to write cool little applications like simulating a coin flip. At the end, we covered packages which are importable Python obejcts with defined methods and attributes. We looked at the `math` package, we includes relevant mathmatical functions like `log10()` and `gcd()`. We also looked at the `random` package, with allowed us to create random variables. 

We will now introduce the Pandas package, which is a Python package that allows users easily to manipulate datasets in Python.

*Note: In the walkthrough below, the first instance of each new Pandas method or attribute will be hyperlinked directly to the Pandas documentation for that method/attribute. Feel free to reference that documentation for additional information.*

# Part 1: The DataFrame

First, to start using Pandas, we have to import it. Typically, Pandas is imported as follows

In [7]:
import pandas as pd

The above code imports Pandas and then assigns the package a nickname, in this case `pd`. Therefore, whenever we refer to Pandas we can just type `pd`. 

In the below code, we use Pandas to read in our first dataset, specifically the `sales.csv` file stored in the "Intro to Python - II" repository. 

In [8]:
sales_csv = r"https://raw.githubusercontent.com/cra-international/Intro-to-Python/master/Intro%20to%20Python%20-%20II/cookie_sales.csv"
sales_df = pd.read_csv(sales_csv)

In the above code, we reference the Pandas package using the `pd` notation discussed above. We then call the [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function, which has one *required* argument, in this case the path to the file. As the documentation notes, "Any valid string path is acceptable. The string could be a URL." So in this case, we pass a URL to the function which itself points to a CSV file.

Great! So, what was returned by this `read_csv()` function? Let's find out! We can use Python's built-in `type()` function to find out.

In [9]:
print(type(sales_df))

<class 'pandas.core.frame.DataFrame'>


In this case, what was returned was a *dataframe*, which is a data type we haven't seen before. That's because dataframes are defined by the Pandas package. We will spend the majority of our time learning about dataframes, as this is where all the magic happens. Dataframes not only store data, but they also have *methods* that allow us to manipulate the data easily. 

### Now you try! 

Create a new dataframe, called `price_df`, that takes the value of the data stored in the `cookie_prices.csv` file. The URL to that file has already been stored in the `price_url` variable below. 

In [5]:
price_url = r"https://github.com/cra-international/Intro-to-Python/blob/master/Intro%20to%20Python%20-%20II/cookie_prices.csv"


## Part 2: Exploring Data

Now that we can read in the data, the next step is to explore the data and see what it looks like. So let's learn our first Pandas function. We can get a quick glimpse of the data by using the [`.head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method, which by default prints the first five observations of the dataset.

In [10]:
sales_df.head()

Unnamed: 0,Troop Number,Cookie,Boxes Sold,Notes
0,179,Thin Mints,43.0,
1,179,Caramel deLites,16.0,
2,179,Peanut Butter Patties,22.0,
3,179,Girl Scout S'mores,11.0,
4,179,Lemonades,34.0,


Here we see that there are 4 columns - Troop Number, Cookie, state, Boxes Sold, and Notes. Based on the looks of it, this looks like Girl Scout Cookie Sales data by Troop and Cookie type.

We can also look at the shape of the data using the [`.shape`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) attribute. We have not covered attributes up to this point, but an *attribute* is a static value associated with a Python object. Like methods, attributes are accessed using the dot operator, however they are not followed by any parentheses. In this case, the `.shape`, attribute of a Pandas DataFrame returns a *tuple* of the number of observations and number of columns in the DataFrame, in that order. We will talk more about tuples below.

In [13]:
print(sales_df.shape)

(40, 4)


Based on this, we can see that the first value of the tuple if `40` and the second is `4`, indicating that there are 40 observations and 4 columns in `sales_df`. Tuples are very simlar to lists as they are ordered and therefore can be accessed using positional indexing. For example, we can get just the number of observations of `sales_df` by doing the following

In [15]:
print("sales_df has", sales_df.shape[0] , "observations.")

sales_df has 40 observations.


The key difference between tuples and lists, however, is that tuples are *not* mutable, meaning you can not change the values of a tuple using positional indexing.

### Now you try!

Use the `.head()` method to look at the `price_df` variable you created above. What does the `price_df` dataset look like? What kind of data is it? Then, print the number of columns of the data to the console by using positional indexing to reference the second value of the tuple returned by `.shape`.

And now let's looks at the dataset again to see the result

In [42]:
pop_df.head()

Unnamed: 0,County_Name,GEO_ID,Population Estimate,B01003_001M,NAME,B01003_001EA,B01003_001MA,state,county
0,"Autauga County, Alabama",0500000US01001,55200,-555555555,"Autauga County, Alabama",,*****,1,1
1,"Baldwin County, Alabama",0500000US01003,208107,-555555555,"Baldwin County, Alabama",,*****,1,3
2,"Barbour County, Alabama",0500000US01005,25782,-555555555,"Barbour County, Alabama",,*****,1,5
3,"Bibb County, Alabama",0500000US01007,22527,-555555555,"Bibb County, Alabama",,*****,1,7
4,"Blount County, Alabama",0500000US01009,57645,-555555555,"Blount County, Alabama",,*****,1,9


Nice! We successfully renamed our variable of interest from that obscure variable name to something useful. Note that in Pandas, we can include spaces in the column names! That's very different than statistical packages like SAS or Stata. 

### Now you try!

In this dataset, the variables "state" and "county" refer to the state FIPS code and county FIPS code. Those are numerical two digit and three digit codes, respectively, that uniquely identify states and counties within a state. Rename those two variables to "state_fips" and "county_fips" using the `.rename()` function. 

The Census data also includes some columns that we don't really need, such as "Name", which looks like a duplicate of "County_Name". We can drop columns using the `.drop()` command. In this instance, we pass a *list* of columns we'd like to drop to the `columns` argument of the `.drop()` method. See below for an example

In [43]:
pop_df = pop_df.drop(columns=["NAME"])
pop_df.head()

Unnamed: 0,County_Name,GEO_ID,Population Estimate,B01003_001M,B01003_001EA,B01003_001MA,state,county
0,"Autauga County, Alabama",0500000US01001,55200,-555555555,,*****,1,1
1,"Baldwin County, Alabama",0500000US01003,208107,-555555555,,*****,1,3
2,"Barbour County, Alabama",0500000US01005,25782,-555555555,,*****,1,5
3,"Bibb County, Alabama",0500000US01007,22527,-555555555,,*****,1,7
4,"Blount County, Alabama",0500000US01009,57645,-555555555,,*****,1,9


### Now you try!

Use the `.drop()` function to drop the three variables that begin with "B01003" and then print the first five observations of the new dataset using the `.head()` function.

In [None]:
pop_df = pop_df.drop(columns=["B01003_001M", "B01003_001EA", "B01003_001MA"])

In [47]:
pop_df.head()

Unnamed: 0,County_Name,GEO_ID,Population Estimate,state,county
0,"Autauga County, Alabama",0500000US01001,55200,1,1
1,"Baldwin County, Alabama",0500000US01003,208107,1,3
2,"Barbour County, Alabama",0500000US01005,25782,1,5
3,"Bibb County, Alabama",0500000US01007,22527,1,7
4,"Blount County, Alabama",0500000US01009,57645,1,9
