# Pandas API walkthrough

Let's walk through the `pandas` API we learned in the lecture video together!

This lab assumes you've watched the pre-lab videos on both Object-Oriented Python and Intro to Pandas. If you didn't have time to do that, go do that now instead.

----

We always need to import pandas since it is outside the Python built-ins. You must execute this in a Python environment that the pandas library has been installed into, since it is not in the Python standard library. If you are using the Anaconda version of Python, you already have it.

In [1]:
import pandas as pd

## Importing Data

Next we will always want to create a DataFrame, since the whole point of pandas is to get your data into a DataFrame so you can use nice DataFrame methods on it to do data analysis.

The easiest way is with a convenience function against a file. The file can be a path on disk, or a URL accessible over the web.

You can check out the entire function signature of `read_csv` [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), but I will highlight just a few of the most important arguments.

A few notes first:

- the first positional argument is always a reference to the file, either a string or a File-like object
- all other arguments are *keyword arguments* (also called "kwargs"), which must be specified by name, NOT by position

Here are some kwargs to help you import data:
- `sep` or `delimiter`: a string of the character that represents the separation between columns on the same line
- `header`: the row number to use as the header, and also the start of the data (all rows before this number will be ignored). Can also use `None` if no header in the data
- `names`: a list of strings of the column names to use, in order
- `dtype`: a dictionary of column names to data types indicating what type each column should be inferred as
- `na_values`: a string or list of strings to interpret as `NaN` if found in a cell
- `index_col`: the column position, as an integer, to use as the index

## Try it yourself!

Each of the following calls to `read_csv` result in a badly imported DataFrame. Use one or more of the kwargs explained above to fix the issue. For your convenience, a preview of each file is also shown. Study the preview, look at the result from a naive call to `read_csv`, and then use the next section of cells to fix it using keyword arguments to `read_csv`.

The four scenarios to try are:
- Wine data - fix the columns
- Iris data - should use columns `["Sepal Width", "Sepal Length", "Petal Width", "Petal Length", "Species"]`
- Hepatitis data - has no header fields, and `"?"` should be interpreted as `NaN`
- Barley full data - use the source data index (which notably starts at 1), and make the year column import as `"object"` type

-------
### Wine data 

**Preview:**
```
"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9.5;6
8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;10.1;6
7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4;9.9;6
```

In [12]:
wine_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv")

In [13]:
wine_df.head()

Unnamed: 0,"fixed acidity;""volatile acidity"";""citric acid"";""residual sugar"";""chlorides"";""free sulfur dioxide"";""total sulfur dioxide"";""density"";""pH"";""sulphates"";""alcohol"";""quality"""
0,7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
1,6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9...
2,8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;1...
3,7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4...
4,7.2;0.23;0.32;8.5;0.058;47;186;0.9956;3.19;0.4...


### Fix the wine data below!

In [None]:
# add a kwarg to `read_csv` that will fix the import
wine_df_2 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv")
wine_df_2.head()

-------
### Iris data

**Preview:**
```
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
```

In [20]:
iris_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
iris_df.head()

Unnamed: 0,5.1,3.5,1.4,0.2,Iris-setosa
0,4.9,3.0,1.4,0.2,Iris-setosa
1,4.7,3.2,1.3,0.2,Iris-setosa
2,4.6,3.1,1.5,0.2,Iris-setosa
3,5.0,3.6,1.4,0.2,Iris-setosa
4,5.4,3.9,1.7,0.4,Iris-setosa


### Fix the iris data below!

In [None]:
iris_df_2 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
iris_df_2.head()

-------
### Hepatitis data

**Preview**:
```
2,30,2,1,2,2,2,2,1,2,2,2,2,2,1.00,85,18,4.0,?,1
2,50,1,1,2,1,2,2,1,2,2,2,2,2,0.90,135,42,3.5,?,1
2,78,1,2,2,1,2,2,2,2,2,2,2,2,0.70,96,32,4.0,?,1
2,31,1,?,1,2,2,2,2,2,2,2,2,2,0.70,46,52,4.0,80,1
2,34,1,2,2,2,2,2,2,2,2,2,2,2,1.00,?,200,4.0,?,1
```

In [16]:
hepatitis_df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data")
hepatitis_df.head()

Unnamed: 0,2,30,2.1,1,2.2,2.3,2.4,2.5,1.1,2.6,2.7,2.8,2.9,2.10,1.00,85,18,4.0,?,1.2
0,2,50,1,1,2,1,2,2,1,2,2,2,2,2,0.9,135,42,3.5,?,1
1,2,78,1,2,2,1,2,2,2,2,2,2,2,2,0.7,96,32,4.0,?,1
2,2,31,1,?,1,2,2,2,2,2,2,2,2,2,0.7,46,52,4.0,80,1
3,2,34,1,2,2,2,2,2,2,2,2,2,2,2,1.0,?,200,4.0,?,1
4,2,34,1,2,2,2,2,2,2,2,2,2,2,2,0.9,95,28,4.0,75,1


### Fix the hepatitis data below!

In [None]:
hepatitis_df_2 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/hepatitis/hepatitis.data")
hepatitis_df_2.head()

-----
### Barley full data

**Preview**:
```
,yield,gen,year,site
1,47.5,Manchuria,1927,StPaul
2,45.4,Glabron,1927,StPaul
3,45,Svansota,1927,StPaul
4,43.4,Velvet,1927,StPaul
```

In [28]:
barley_full_df = pd.read_csv("https://gist.githubusercontent.com/anonymous/58f6723852c25df8f9d5bfbead633367/raw/51503c5e8ec4ea2197671fdc6188867e4e9f377c/barleyfull.csv")
barley_full_df.head()

Unnamed: 0.1,Unnamed: 0,yield,gen,year,site
0,1,47.5,Manchuria,1927,StPaul
1,2,45.4,Glabron,1927,StPaul
2,3,45.0,Svansota,1927,StPaul
3,4,43.4,Velvet,1927,StPaul
4,5,60.2,Trebi,1927,StPaul


In [30]:
barley_full_df.dtypes

Unnamed: 0      int64
yield         float64
gen            object
year            int64
site           object
dtype: object

### Fix the barley full data below!

In [None]:
barley_full_df_2 = pd.read_csv("https://gist.githubusercontent.com/anonymous/58f6723852c25df8f9d5bfbead633367/raw/51503c5e8ec4ea2197671fdc6188867e4e9f377c/barleyfull.csv")
barley_full_df_2.head()

In [None]:
barley_full_df_2.dtypes

----
For more reading on importing (and exporting!) data, see: https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html

-----

## Indexing dataframes

We saw in the lecture that there are three ways to index DataFrames:

- by column, using `df[]` notation
- using location, using `df.loc[]` notation
- using index location, using `df.iloc[]` notation

In each of these cases, the square brackets `[]` are filled with something different, particular to the type of index notation being used.

Column notation always takes a column name, and returns the specified **column** as a `Series` object. For example, `df['Sepal Width']`.

`iloc` can take only take index positions, and `loc` can only take names, and in their basic form returns the specified **row** as a `Series` object. For example, `df.iloc[0]` or `df.loc['row 1']` would each pull out the rows

The more advanced form of `iloc` and `loc` is to provide two coordinates, for when you want to pull out a single cell. They are always in the format and order of `[row, column]`; such as `df.iloc[0,1]` for row position 0, column position 1, or `df.loc['1999', 'GDP']` for row name "1999" and column name "GDP".

Let's try each of these using the dataframes you 'fixed' above.

### Fill-in-the-blank mode

For the first few, I've set up the skeleton for you.

In [None]:
# use your iris dataframe to pull out the Sepal Width column as a Series
iris_df_2[]

In [None]:
# use the hepatitis dataframe to pull out the value at row index 2, column index 1.
# the resulting value should be 78
hepatitis_df_2.iloc[]

In [None]:
# use the wine data to pull out the value at row name 2, column name "citric acid"
# the value should be 0.4
wine_df_2.loc[]

### Totally blank mode

For these, I haven't provided anything besides the expected dataframe variable names.

In [None]:
# use your iris dataframe to pull out the row at index position 8 as a Series
iris_df_2

In [None]:
# use the barley full data to pull out the value at row name 4, column name "gen"
# the value should be "Velvet"
# note that your barley full data row *names* should no longer be 0-indexed!
barley_full_df_2

In [None]:
# use the hepatitis data to pull out the value 96 
# you should be able to see where it is located in the head of the dataframe
hepatitis_df_2

-----

## Querying dataframes

In the lecture we also discussed how to query dataframes using boolean masks to retrieve subsets of the source dataframe that match our query. Let's practice that here.

For example, if we wanted to get all of the values in our iris dataset that have a sepal width of 6.9, we would use syntax like this:

```
iris_df_2[iris_df_2['Sepal Width'] == 6.9]
```

Note that `iris_df_2` is mentioned twice; once in the inner structure that constructs the boolean mask by performing an equality operation against the entire column of "Sepal Width", and again in the outer layer to indicate that the boolean mask that is a result of that expression should be applied to "show" only the matching records in that same dataframe.

We could split that into two pieces instead:

```
matching_rows = iris_df_2['Sepal Width'] == 6.9
iris_df_2[matching_rows]
```

Try this format if you are getting your syntax mixed up at any point in the following exercises.

You can also string together multiple expressions using the `&` operator for "and" / `|` operator for "or", as long as you put each individual expression in parenthesis. For example:

```
iris_df_2[(iris_df_2['Sepal Width'] == 6.9) & (iris_df_2['Species'] == 'Iris-virginica')]
```

You can leave off the parenthesis if you evaluate the individual expressions first:

```
first_case = iris_df_2['Sepal Width'] == 6.9
second_case = iris_df_2['Species'] == 'Iris-virginica'
iris_df_2[first_case & second_case]
```


### Fill-in-the-?????? mode

For the first few, I've added the skeleton and you need to fill in anywhere you see question marks with the appropriate syntax.

In [None]:
# Get the records in the barley full dataset that are from the StPaul site.
# there should be 127 records
barley_full_df_2[barley_full_df_2[???????] == ??????? ]

In [None]:
# Get the records in the wine dataset where the alcohol field is over 12.0
# there should be 711 records
wine_df_2[wine_df_2[??????] > ?????]

In [None]:
# Get the records in the iris dataset 
# where the Sepal Width is over 5.0 and the Petal Width is under 1.5
# there should be 7 records
iris_df_2[(iris_df_2[?????] > ?????) & (iris_df_2[??????] < ?????)]

### Totally blank mode
For these next few, I haven't filled in anything at all besides the questions. Good luck!

In [None]:
# Get the records in the barley full dataset where the yield was over 30.0 in Duluth
# there should be 63 records

In [None]:
# Get the records in the wine dataset where the pH is greater or equal to 3.8
# (there should be 4 records)

In [None]:
# Get the records in the hepatitis dataset where the first column (column 0) is not 2
# (there should be 32 records)

----
For more reading on indexing and querying DataFrames, see: https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html