Pandas Data Frames
==========

The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column. So, a data frame and series are synonomous with table and column.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

In [None]:
# LOAD IN A DATA FRAME
from matplotlib import pyplot as plt
import pandas as pd
df = pd.read_csv('../data/gapminder_gdp_europe.csv', index_col='country')

## Inspecting Data


* We can print the first or last x number of rows of our data frame using the head() and tails() functions.

In [None]:
print(df.head(3))

In [None]:
print(df.tail(3))

* We can use `dftypes` and `info()` to also get information about the data

In [None]:
print(df.dtypes)

In [None]:
print(df.info())

* Use `describe()` to get basic statistical info about the data

In [None]:
print(df.describe())

* Can also use describe() on data frame selections (like a single column)

In [None]:
print( df["gdpPercap_1982"].describe() )

* Use `shape` to get the row and column numbers

In [None]:
print(df.shape)

Here we can see the that the data have 30 rows of data and 12 attributes worth of information.

* Use the `len()` function to get numbers of each individually

In [None]:
# print number of rows of data
print(len(df))

In [None]:
# print number of columns of data
print(len(df.columns))

* Get the column names with `columns()`

In [None]:
print(df.columns)

* Use a column name to get all values for that columns

---
## Get information about a particular column

* Operations like mean, max, min, can be used on individual columns

In [None]:
# Mean GDP in 1967
print(df["gdpPercap_1967"].mean())
# Mean GDP in 1972
print(df["gdpPercap_1972"].mean())
# Mean GDP in 1977
print(df["gdpPercap_1977"].mean())


## Rearange Columns

* Difficult to do using a csv library and doing this by hand
* The reverse() function will reverse the ordering of a list
     * E.g.   `['a', 'b', 'c']` to `['c', 'b', 'a']`

In [None]:
cols = list(df.columns)
print( cols )

cols.reverse()
print ( cols )

* Using that now reversed list above, we can create a new list, with the values order in reverse

In [None]:
new_df = df[cols]
new_df.head(3)

## Selecting values

Remember that a DataFrame provides a index as a way to identify the rows of the table. A row also has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

To access a value at the position [ i , j ] (row, column) of a DataFrame, we have two options, depending on what is the meaning of i in use.



We can see that the first value in the first column is `1601.056136`. Remember that in programming, we begin our counting at zero, so this index would be `0,0`, not `1,1`.

### Use DataFrame.iloc[..., ...] to select values by their position
* Allows you to specify location by numerical index similar to 2D version of character selection in strings.


In [None]:
print("\nData value in first row at first column: ", df.iloc[0, 0])

In [None]:
print("\nData value in fifth row at third column: ", df.iloc[4, 2])

### Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

*   Can specify location by name or by numerical index.

In [None]:
print(df.loc["Albania", "gdpPercap_1952"])

In [None]:
print(df.loc['Bulgaria', "gdpPercap_1962"])

### Use `:` on its own to mean all columns or all rows.

*   Just like Python's usual slicing notation, we can print all columns or all rows with `.loc` using the `:`

In [None]:
print(df.loc["Albania",:])

* Would get the same result printing `df.iloc[0]` (without a second index).
* We can also omit the `:` and get the same result in either case.
    * e.g. `df.loc["Albania"]`

In [None]:
print(df.loc[:, "gdpPercap_1952"])

*   Would get the same result printing `df["gdpPercap_1952"]`
*   Also get the same result printing `df.gdpPercap_1952` (since it's a column name)

### Select multiple columns or rows using `DataFrame.iloc` and a named slice.

In [None]:
print(df.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'])

In the above code, we discover that **slicing using indexes is inclusive at both ends**, which differs from typical python behavior where slicing indicates everything up to but not including the final index.

However, if we use integers when our DataFrame is indexed by something else, slicing follows typical pythonic behavior.

## Result of slicing can be used in further operations.

In [None]:
print(df.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].max())

In [None]:
print(df.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972'].min())

*   Usually don't just print a slice.
*   All the statistical operators that work on entire data frames work the same way on slices.

## Create data frame from selections

We can create new data frame by selecting data frames based on values

In [None]:
# Use a subset of data to keep output readable.
subsetdf = df.loc['Italy':'Poland', 'gdpPercap_1962':'gdpPercap_1972']
print('Subset of data:\n', subset)

### Create DataFrame using query

* By passing a dataframe query to a itself, we can create a new dataframe with only those values

In [None]:
subset10kdf = subsetdf[subsetdf["gdpPercap_1962"] >= 10000]
# this is the same as subset[subset >= 10000]
#print("Type: ", type(subset10kdf))
print(subset10kdf)

*   Get the value where the mask is true, and NaN (Not a Number) where it is false.
*   Useful because NaNs are ignored by operations like max, min, average, etc.

## Filter a DataFrame using a Boolean mask

* A frame full of Booleans is sometimes called a *mask* because of how it can be used
* Comparison is applied element by element
* Returns a similarly-shaped data frame of `True` and `False`

In [None]:
mask10k = subsetdf >= 10000
print( mask10k )

* We can use masks to filter an entire dataframe with a single query
    * More efficient than using a single query on multiple columns

In [None]:
subset = subsetdf[mask10k]
print(subset)

## Quiz

* Create three data frames and get the size of each one.
    1. Countries with a gdp per capita above 10000 in 1952
    1. Countries with a gdp per capita above 10000 in 1962
    1. Countries with a gdp per capita above 10000 in 1972



In [None]:
df_1 = df[df["gdpPercap_1952"] > 10000]
df_1.shape

In [None]:
df_1 = df[df["gdpPercap_1962"] > 10000]
df_1.shape

In [None]:
df_1 = df[df["gdpPercap_1972"] > 10000]
df_1.shape

## Create new columns

* We can easily create new columns in the same way we would add a key and value to a dictionary

In [None]:
# Create a new column diff_07_52 that is the difference between gdp per capita from 1952 to 2007
df["diff_07_52"] = df["gdpPercap_2007"] - df["gdpPercap_1952"]
df.head()

# Exercises

> ## Selection of Individual Values
>
> Assume Pandas has been imported into your notebook and the Gapminder GDP data for Europe has been loaded:
>
> ~~~
> import pandas
>
> df = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
> ~~~
>
> Write an expression to find the Per Capita GDP of Serbia in 2007.

>> ### Solution:
>>The selection can be done by using the labels for both the row (“Serbia”) and the column (“gdpPercap_2007”):

>>`print(df.loc['Serbia', 'gdpPercap_2007'])`

>>The output is

>>`9786.534714`

> ## Extent of Slicing
>
> 1.  Do the two statements below produce the same output?
> 2.  Based on this,what rule governs what is included (or not) in numerical slices and named slices in Pandas?
>
> ~~~
> print(df.iloc[0:2, 0:2])
> print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])
> ~~~
> 
>>### Solution
>> No, they do not produce the same output! The output of the first statement is:

>>        gdpPercap_1952  gdpPercap_1957
>>country                                
>>Albania     1601.056136     1942.284244
>>Austria     6137.076492     8842.598030

>>The second statement gives:
~~~
>>           gdpPercap_1952  gdpPercap_1957  gdpPercap_1962 \
>>country                                                
>>Albania     1601.056136     1942.284244     2312.888958
>>Austria     6137.076492     8842.598030    10750.721110
>>Belgium     8343.105127     9714.960623    10991.206760
~~~
>>Clearly, the second statement produces an additional column compared to the first statement. What conclusion can we draw? We see that a numerical slice, 0:2, omits the final index (i.e. index 2) in the range provided, while a named slice, ‘gdpPercap_1952’:’gdpPercap_1962’, includes the final element

> ## Reconstructing Data
>
> Explain what each line in the following short program does:
> what is in `first`, `second`, etc.?
>
> ~~~
> first = pandas.read_csv('data/gapminder_gdp_all.csv', index_col='country')
> second = df[df['continent'] == 'Americas']
> third = second.drop('Puerto Rico')
> fourth = third.drop('continent', axis = 1)
> fourth.to_csv('result.csv')
> ~~~

>>### Solution
>>Let’s go through this piece of code line by line.
>>
>>`first = pandas.read_csv('data/gapminder_all.csv', index_col='country')`
>>
>>This line loads the dataset containing the GDP data from all countries into a dataframe called first. The index_col='country' parameter selects which column to use as the row labels in the dataframe.
>>
>>`second = first[first['continent'] == 'Americas']`
>>
>>This line makes a selection: only those rows of first for which the ‘continent’ column matches ‘Americas’ are extracted. Notice how the Boolean expression inside the brackets, first['continent'] == 'Americas', is used to select only those rows where the expression is true. Try printing this expression! Can you print also its individual True/False elements? (hint: first assign the expression to a variable)
>>
>>`third = second.drop('Puerto Rico')`
>>
>>As the syntax suggests, this line drops the row from second where the label is ‘Puerto Rico’. The resulting dataframe third has one row less than the original dataframe second.
>>
>>`fourth = third.drop('continent', axis = 1)`
>>
>>Again we apply the drop function, but in this case we are dropping not a row but a whole column. To accomplish this, we need to specify also the axis parameter (we want to drop the second column which has index 1).
>>
>>`fourth.to_csv('result.csv')`
>>
>>The final step is to write the data that we have been working on to a csv file. Pandas makes this easy with the to_csv() function. The only required argument to the function is the filename. Note that the file will be written in the directory from which you started the Jupyter or Python session.

> ## Selecting Indices
>
> Explain in simple terms what `idxmin` and `idxmax` do in the short program below.
> When would you use these methods?
>
> ~~~
> df = pandas.load_csv('data/gapminder_gdp_europe.csv', index_col='country')
> print(df.idxmin())
> print(df.idxmax())
> ~~~
>> ### Solution
>> For each column in data, idxmin will return the index value corresponding to each column’s minimum; idxmax will do accordingly the same for each column’s maximum value.

> ## Practice with Selection.
>
> Assume Pandas has been imported and the Gapminder GDP data for Europe has been loaded.
> Write an expression to select each of the following:
>
> 1.  GDP per capita for all countries in 1982.
> 2.  GDP per capita for Denmark for all years.
> 3.  GDP per capita for all countries for years *after* 1985.
> 4.  GDP per capita for each country in 2007 as a multiple of 
>     GDP per capita for that country in 1952.

>>### Solution
>>1: `df['gdpPercap_1982']`
>>
>>2: `df.loc['Denmark',:]`
>>
>>3: `df.loc[:,'gdpPercap_1985':]`
>>
>>4: `df['gdpPercap_2007']/data['gdpPercap_1952']`
>>

>> ## Interpretation
>>
>> Poland's borders have been stable since 1945, but changed several times in the years before then. How would you handle this if you were creating a table of GDP per capita for Poland for the entire Twentieth Century?

---
## Keypoints:
 - "Use `DataFrame.iloc[..., ...]` to select values by index location."
 - "Use `:` on its own to mean all columns or all rows."
 - "Select multiple columns or rows using `DataFrame.ix` and a named slice."
 - "Result of slicing can be used in further operations."
 - "Use comparisons to select data based on value."
 - "Select values or NaN using a Boolean mask."