# Working With Pandas Data Frames
---

In [30]:
# Import the pandas library and read in the csv of the gapmindre data for europe as a data frame.
# Be sure to use the "country" series as the index column.
import pandas as pd
df = pd.read_csv('../data/gapminder_gdp_europe.csv', index_col='country')

* print the data frame to view it

## Inspecting Data
We've already see how we can get information about a dataframe using the `.info()` and `.describe()` functions, but there are many ways to get information and view a data frame

* Can also use describe() on data frame selections (like a single column)

In [2]:
# Use the .describe() function on the 
print( df["gdpPercap_1982"].describe() )

count       30.000000
mean     15617.896551
std       6453.234827
min       3630.880722
25%      11449.870115
50%      15322.824720
75%      20901.729730
max      28397.715120
Name: gdpPercap_1982, dtype: float64


* We can print the first or last x number of rows of our data frame using the head() and tails() functions.

In [3]:
print(df.head(3))

         gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
country                                                                   
Albania     1601.056136     1942.284244     2312.888958     2760.196931   
Austria     6137.076492     8842.598030    10750.721110    12834.602400   
Belgium     8343.105127     9714.960623    10991.206760    13149.041190   

         gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
country                                                                   
Albania     3313.422188      3533.00391     3630.880722     3738.932735   
Austria    16661.625600     19749.42230    21597.083620    23687.826070   
Belgium    16672.143560     19117.97448    20979.845890    22525.563080   

         gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
country                                                                  
Albania     2497.437901     3193.054604     4604.211737     5937.029526  
Austria    27042.018680   

In [4]:
print(df.tail(3))

                gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
country                                                          
Switzerland       14734.232750    17909.489730    20431.092700   
Turkey             1969.100980     2218.754257     2322.869908   
United Kingdom     9979.508487    11283.177950    12477.177070   

                gdpPercap_1967  gdpPercap_1972  gdpPercap_1977  \
country                                                          
Switzerland       22966.144320     27195.11304    26982.290520   
Turkey             2826.356387      3450.69638     4269.122326   
United Kingdom    14142.850890     15895.11641    17428.748460   

                gdpPercap_1982  gdpPercap_1987  gdpPercap_1992  \
country                                                          
Switzerland       28397.715120    30281.704590    31871.530300   
Turkey             4241.356344     5089.043686     5678.348271   
United Kingdom    18232.424520    21664.787670    22705.092540   

       

In [5]:
print(df.dtypes)

gdpPercap_1952    float64
gdpPercap_1957    float64
gdpPercap_1962    float64
gdpPercap_1967    float64
gdpPercap_1972    float64
gdpPercap_1977    float64
gdpPercap_1982    float64
gdpPercap_1987    float64
gdpPercap_1992    float64
gdpPercap_1997    float64
gdpPercap_2002    float64
gdpPercap_2007    float64
dtype: object


* Use `shape` to get the row and column numbers

In [6]:
print(df.shape)

(30, 12)


Here we can see the that the data have 30 rows of data and 12 attributes worth of information.

* Use the `len()` function to get numbers of each individually

In [7]:
# print number of rows of data
print(len(df))

30


In [8]:
# print number of columns of data
print(len(df.columns))

12


* Use a column name to get all values for that column

In [31]:
# Print out only the gdpPercap_1962 column from the data frame


---
## EXERCISE:
1. How many countries are there in the gapminder_all.csv file?

---

---
## EXERCISE:
1. What is the last country listed in the gapminder_all file?

---

---
## Get information about a particular column

* Operations like mean, max, min, can be used on individual columns

In [32]:
# Print the mean GDP in 1967 below

# Print the Mean GDP in 1972 below

# Print the Mean GDP in 1977 below


---
## EXERCISE:
1. What is the average (mean) GDP value for all countries in 1992?
2. What about the max value for all countries in 1952?


---

## Rearange Columns

* Difficult to do using a csv library or by hand
* The reverse() function will reverse the ordering of a list
     * E.g.   `['a', 'b', 'c']` to `['c', 'b', 'a']`

In [34]:
# Use python list() function to get the data frame columns a s list
cols = 
print( cols )

# use the .reverse() function to reverse the ordering of the columns

print ( cols )

Index(['gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962', 'gdpPercap_1967',
       'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982', 'gdpPercap_1987',
       'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002', 'gdpPercap_2007'],
      dtype='object')


AttributeError: 'Index' object has no attribute 'reverse'

* Using that now reversed list above, we can create a new list, with the values order in reverse

In [12]:
new_df = 
new_df.head(3)

Unnamed: 0_level_0,gdpPercap_2007,gdpPercap_2002,gdpPercap_1997,gdpPercap_1992,gdpPercap_1987,gdpPercap_1982,gdpPercap_1977,gdpPercap_1972,gdpPercap_1967,gdpPercap_1962,gdpPercap_1957,gdpPercap_1952
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Albania,5937.029526,4604.211737,3193.054604,2497.437901,3738.932735,3630.880722,3533.00391,3313.422188,2760.196931,2312.888958,1942.284244,1601.056136
Austria,36126.4927,32417.60769,29095.92066,27042.01868,23687.82607,21597.08362,19749.4223,16661.6256,12834.6024,10750.72111,8842.59803,6137.076492
Belgium,33692.60508,30485.88375,27561.19663,25575.57069,22525.56308,20979.84589,19117.97448,16672.14356,13149.04119,10991.20676,9714.960623,8343.105127


## Transposing tables

In many cases we may need to transpose the column and rows in a table.  Pandas allows us to to this easily with the `T` command.


In [13]:
# Print first three rows of the data frame

# Transpose the dataframe and print the first three rows


Data Frame:
          gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
country                                                                   
Albania     1601.056136     1942.284244     2312.888958     2760.196931   
Austria     6137.076492     8842.598030    10750.721110    12834.602400   
Belgium     8343.105127     9714.960623    10991.206760    13149.041190   

         gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
country                                                                   
Albania     3313.422188      3533.00391     3630.880722     3738.932735   
Austria    16661.625600     19749.42230    21597.083620    23687.826070   
Belgium    16672.143560     19117.97448    20979.845890    22525.563080   

         gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
country                                                                  
Albania     2497.437901     3193.054604     4604.211737     5937.029526  
Austria    27

---
## EXERCISE:
1. Read in a new data frame for the gapminder_gdp_americas.csv file
2. Print the last last three columns of the data frame

---

## Selecting values

Data Frames provides a index as a way to identify the rows of the table. A row also has a position inside the table as well as a label, which uniquely identifies its entry in the DataFrame.

To access a value at the position [ i , j ] (row, column) of a Data Frame, we have two options, depending on what is the meaning of i in use.

### Use DataFrame.iloc[..., ...] to select values by their position
* Allows you to specify location by numerical index similar to 2D version of character selection in strings.


In [14]:
print("\nData value in first row at first column: ", df.iloc[0, 0])


Data value in first row at first column:  1601.056136


In [15]:
print("\nData value in fifth row at third column: ", df.iloc[4, 2])


Data value in fifth row at third column:  4254.337839


### Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

*   Can specify location by name or by numerical index.

In [16]:
# Print the value of Albanias GDP per capita in 1952


1601.056136


In [17]:
# Print the value of Bulgarias GDP per capita in 1962


4254.337839


---
## EXERCISE
~~~
import pandas
df = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
~~~

1. Find the Per Capita GDPs for Serbia (Serbia is in the Europe CSV data file).
1. Find the Per Capita GDP for Serbia in 2007.

---
### Use `:` on its own to mean all columns or all rows.

*   Just like Python's usual slicing notation, we can print all columns or all rows with `.loc` using the `:`

In [36]:
# Print the GDP per capita of all years for Albania


* Would get the same result printing `df.iloc[0]` (without a second index).
* We can also omit the `:` and get the same result in either case.
    * e.g. `df.loc["Albania"]`

In [35]:
# Print the GDP per capita for all countries in 1952


*   Would get the same result printing `df["gdpPercap_1952"]`
*   Also get the same result printing `df.gdpPercap_1952` (since it's a column name)

---
## EXERCISE:
1. Print out GDP per capita for all countries in 1972



---
### We can also use the `:` to select whole sections of a table 
* Similar to the way we would select a section of from a normal python list, we can do the same with data frames.

In [37]:
# Print a selection of all countries in the table from Italy to Poland from years 1962 to 1972


Note that in Pandas **slicing using indexes is inclusive at both ends**, which differs from typical python behavior where slicing indicates everything up to but not including the final index.

### Select multiple columns or rows using `DataFrame.iloc` and a named slice.
* We can also make selection from a data frame using the index location of the row or column
    * Remember that in programming languages, we start counting at 0

In [39]:
#Print the first row of the data frame using the .head() function
print("First row of data frame:\n",  )

# Use iloc to print the value in the first row of the first column
print("\n\nValue in the first row of the first column:\n",  )

# Use iloc to print the values of the first two columns in the first row
print("Values in the first two columes in the first row:\n",  )

First row of data frame:



Value in the first row of the first column:

Values in the first two columes in the first row:



* **Note that unlike slicing using column or row names, slicing using indexes is not inclusive** 

---
## EXERCISE:
1. Print out all values from Hungary through Montenegro for the years 1977 through 1997

---
## EXERCISE:
1.  Do the two statements below produce the same output?
    ~~~
    print(df.iloc[0:2, 0:2])
    print(df.loc['Albania':'Belgium', 'gdpPercap_1952':'gdpPercap_1962'])
    ~~~

1.  Based on this,what rule governs what is included (or not) in numerical slices and named slices in Pandas?


---
### Slicing individual rows and columns
* Instead of creating slices of *this* to *that* using the `:`, we can also slice using individual rows and columns by placing names or indexes in brackets `[]`.

In [40]:
# Print out the GDP per capita of only Italy, Austria, and the United Kingdom in the years 2007 and 1957


---
## EXERCISE:
1. Using the index locations, `print` out the first, third, and eight columns for the sixteenth through nineteenth rows.

---
## Result of slicing can be used in further operations.

In [23]:
# Print out the max value (.max()) of ALL countries from Italy to Poland for 1962 to 1972


gdpPercap_1962    13450.40151
gdpPercap_1967    16361.87647
gdpPercap_1972    18965.05551
dtype: float64


In [41]:
# Print out the min value (.min()) of ALL countries from Italy to Poland for 1962 to 1972


*   Usually don't just print a slice.
*   All the statistical operators that work on entire data frames work the same way on slices.

## Create data frame from selections

* We can create new data frame by selecting data frames based on values and assigining it to a variable

In [42]:
# Create a selection of ALL countries from Italy to Poland for 1962 to 1972 and 
#  assign the selection to a variable name "subset_df"

print('Subset of data:\n', subsetdf)

Subset of data:
              gdpPercap_1962  gdpPercap_1967  gdpPercap_1972
country                                                    
Italy           8243.582340    10022.401310    12269.273780
Montenegro      4649.593785     5907.850937     7778.414017
Netherlands    12790.849560    15363.251360    18794.745670
Norway         13450.401510    16361.876470    18965.055510
Poland          5338.752143     6557.152776     8006.506993


## Create DataFrame using query
* We can query values in a data frame to create new selections
* By passing a dataframe query to itself, we can create a new dataframe with only those values

In [52]:
# Create a query for the "gdpPerCap_1962" series in out subset_df data frame for all values greater than 10000

# Create a new data frame called subset_10k_df by passing that query to the subset_df data frame


print(query_10k)
print(subset_10k_df)
print(subset_10k_df.shape)

country
Italy          False
Montenegro     False
Netherlands     True
Norway          True
Poland         False
Name: gdpPercap_1962, dtype: bool


NameError: name 'subset_10k_df' is not defined

---
## EXERCISE:

* Create three data frames and get the size of each one.
    1. Countries with a gdp per capita in 1952 above 10000
    1. Countries with a gdp per capita in 1962 above 10000
    1. Countries with a gdp per capita in 1972 above 10000

---

## Filter a DataFrame using a Boolean mask

* A frame full of Booleans is sometimes called a *mask* because of how it can be used
* Comparison is applied element by element
* Returns a similarly-shaped data frame of `True` and `False`

In [None]:
# Create a full data frame mask for subset_df with a query for all values in all years greater than 10000
mask_10k = 

print( mask_10k )

* We can use masks to filter an entire dataframe with a single query
    * More efficient than using a single query on multiple columns

In [None]:
# Pass the mask query to subset_df to create a new data frame mask_subset
mask_subset = 

print(mask_subset)
print("Shape: ", mask_subset.shape)

*   Returns the value where the mask is true, and NaN (Not a Number) where it is false.
*   Useful because NaNs are ignored by operations like max, min, average, etc.


* If we wanted to remove all rows with a NaN value in any column we could use the `.dropna()` function

In [None]:
# Print the mask_subset data frame with all rows with a single NaN value removed


# Print the shape of the mask_subset data frame with all rows with an NaN values removed


## Create new columns

* We can easily create new columns in the same way we would add a key and value to a dictionary

In [None]:
# Create a new column in data frame called diff_07_52 that is the difference between gdp per capita from 1952 to 2007


print(df.head(1))

---
## EtherPad

On EtherPad explain what the follow expression does:

    only_Am = df[df['continent'] == 'Americas']

___

## EXERCISE:
1. Explain in simple terms what `idxmin` and `idxmax` do in the short program below.
    ~~~
    df = pd.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
    print(df.idxmin())
    print(df.idymax())
    ~~~

2. When would you use these methods?

---
## PRACTICE EXERCISE.
Using the Gapminder GDP data for Europe, write an expression to select each of the following:
1.  GDP per capita for all countries in 1982.
1.  GDP per capita for Denmark for all years.
1.  GDP per capita for all countries for years *after* 1985.
1.  GDP per capita for each country in 2007 as a multiple of GDP per capita for that country in 1952.
---

# -- COMMIT YOUR WORK TO GITHUB --

---
## Keypoints:
 - "Use `DataFrame.iloc[..., ...]` to select values by index location."
 - "Use `:` on its own to mean all columns or all rows."
 - "Select multiple columns or rows using `DataFrame.ix` and a named slice."
 - "Result of slicing can be used in further operations."
 - "Use comparisons to select data based on value."
 - "Select values or NaN using a Boolean mask."