# Python Part II: Intro to Pandas

In Python Part I, we flew through introductory Python concepts, everything from the `print()` statement to For Loops, and used that information to write cool little applications like simulating a coin flip. At the end, we covered packages which are importable Python obejcts with defined methods and attributes. We looked at the `math` package, we includes relevant mathmatical functions like `log10()` and `gcd()`. We also looked at the `random` package, with allowed us to create random variables. 

We will now introduce the Pandas package, which is a Python package that allows users easily to manipulate datasets in Python.

# Part 1: The DataFrame

First, to start using Pandas, we have to import it. Typically, Pandas is imported as follows

In [8]:
import pandas as pd

The above code imports Pandas and then assigns the package a nickname, in this case `pd`. Therefore, whenever we refer to Pandas we can just type `pd`. 

With that, let's read in some data! The New York Times has been publishing their coronovirus case data online in a [GitHub repository](https://github.com/nytimes/covid-19-data). For this exercise, we will pull US county-level data on Coronovirus cases using [this link](https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv). The data is in CSV (Comma-Separated Value) format. 

*Note: Reading in the data may take some time. Please wait until you no longer see an asteriks (*****) next to the code below before you continue.*

In [12]:
county_url = "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"
df = pd.read_csv(county_url)

In the above code, we reference the Pandas package using the `pd` notation discussed above. We then call the [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) function, which has one *required* argument, in this case the path to the file. As the documentation notes, "Any valid string path is acceptable. The string could be a URL." So in this case, we pass a URL to the function which itself points to a CSV file.

Great! So, what was returned by this `read_csv()` function? Let's find out! We can use Python's built-in `type()` to find out.

In [10]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


In this case, what was returned was a *dataframe*, which is a data type we haven't seen before. That's because dataframes are defined by the Pandas package. We will spend the majority of our time learning about dataframes, as this is where all the magic happens. Dataframes not only store data, but they also have *methods* that allow us to manipulate the data easily. 

### Now you try! 

Create a new dataframe, called `us_df` that stores data from the *national-level* NY Times Coronovirus data, located [here](https://raw.githubusercontent.com/nytimes/covid-19-data/master/us.csv).

## Part 2: Exploring Data

Now that we can read in the data, the next step is to explore the data and see what it looks like. So let's learn our first Pandas function. We can get a quick glimpse of the data by using the `.head()` method, which by default prints the first five observations of the dataset and all columns.

In [13]:
df.head()

Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0
1,2020-01-22,Snohomish,Washington,53061.0,1,0
2,2020-01-23,Snohomish,Washington,53061.0,1,0
3,2020-01-24,Cook,Illinois,17031.0,1,0
4,2020-01-24,Snohomish,Washington,53061.0,1,0


Here we see that there are 6 columns - date, county, state, fips, cases, and deaths. It also looks like the data is sorted by date, with data on the earliest cases being at the top of the dataset.

### Now you try!

Below is some code that creates a dataframe `pop_df` which is 2018 county-level population estimates from the Census Bureau. Don't worry about the code that creates this dataset right now, we'll come back to it later on. For now, run the code cell that creates the dataset and then write code that prints first 5 observations of the dataset in the code cell below. Then, run the cell.  

In [35]:
import requests
r = requests.get("https://api.census.gov/data/2018/acs/acs5?get=NAME,group(B01003)&for=county")
pop_data = r.json()
pop_df = pd.DataFrame(pop_data[1:], columns=["County_Name"] + pop_data[0][1:])

pop_df.head()

(Maybe print number of observations? Summary statistics?)

## Part 3: Cleaning Data

It's not entirely clear what is going on with this data, so it'd be great if we could rename some variables to make them a bit more clear.

In the dataset, the variable "B01003_001E" is actually the county-level population estimate. So, let's rename that to "Population Estimate" using the `.rename()` command. To do this, we pass a *dictionary* to the optional argument `columns` (it's optional because you can technically rename observations, but more on that later). That dictionary is formatted such that the *key* is the original variable name and the *value* is the new variable name. Multiple columns can be rename all at once by including multiple key-value pairs within the dictionary. Let's see an example of `.rename()` in action below.

In [41]:
pop_df = pop_df.rename(columns={"B01003_001E" : "Population Estimate"})

And now let's looks at the dataset again to see the result

In [42]:
pop_df.head()

Unnamed: 0,County_Name,GEO_ID,Population Estimate,B01003_001M,NAME,B01003_001EA,B01003_001MA,state,county
0,"Autauga County, Alabama",0500000US01001,55200,-555555555,"Autauga County, Alabama",,*****,1,1
1,"Baldwin County, Alabama",0500000US01003,208107,-555555555,"Baldwin County, Alabama",,*****,1,3
2,"Barbour County, Alabama",0500000US01005,25782,-555555555,"Barbour County, Alabama",,*****,1,5
3,"Bibb County, Alabama",0500000US01007,22527,-555555555,"Bibb County, Alabama",,*****,1,7
4,"Blount County, Alabama",0500000US01009,57645,-555555555,"Blount County, Alabama",,*****,1,9


Nice! We successfully renamed our variable of interest from that obscure variable name to something useful. Note that in Pandas, we can include spaces in the column names! That's very different than statistical packages like SAS or Stata. 

### Now you try!

In this dataset, the variables "state" and "county" refer to the state FIPS code and county FIPS code. Those are numerical two digit and three digit codes, respectively, that uniquely identify states and counties within a state. Rename those two variables to "state_fips" and "county_fips" using the `.rename()` function. 

The Census data also includes some columns that we don't really need, such as "Name", which looks like a duplicate of "County_Name". We can drop columns using the `.drop()` command. In this instance, we pass a *list* of columns we'd like to drop to the `columns` argument of the `.drop()` method. See below for an example

In [43]:
pop_df = pop_df.drop(columns=["NAME"])
pop_df.head()

Unnamed: 0,County_Name,GEO_ID,Population Estimate,B01003_001M,B01003_001EA,B01003_001MA,state,county
0,"Autauga County, Alabama",0500000US01001,55200,-555555555,,*****,1,1
1,"Baldwin County, Alabama",0500000US01003,208107,-555555555,,*****,1,3
2,"Barbour County, Alabama",0500000US01005,25782,-555555555,,*****,1,5
3,"Bibb County, Alabama",0500000US01007,22527,-555555555,,*****,1,7
4,"Blount County, Alabama",0500000US01009,57645,-555555555,,*****,1,9


### Now you try!

Use the `.drop()` function to drop the three variables that begin with "B01003" and then print the first five observations of the new dataset using the `.head()` function.

In [None]:
pop_df = pop_df.drop(columns=["B01003_001M", "B01003_001EA", "B01003_001MA"])

In [47]:
pop_df.head()

Unnamed: 0,County_Name,GEO_ID,Population Estimate,state,county
0,"Autauga County, Alabama",0500000US01001,55200,1,1
1,"Baldwin County, Alabama",0500000US01003,208107,1,3
2,"Barbour County, Alabama",0500000US01005,25782,1,5
3,"Bibb County, Alabama",0500000US01007,22527,1,7
4,"Blount County, Alabama",0500000US01009,57645,1,9
