Reading a file with Pandas
==
---

Pandas is a library widely used for statistics and analysis
* Has functions which allow you to read a file directly into your script
* Borrows many feature from R's data frames

*   Read a Comma Separate Values (CSV) data file with `pandas.read_csv`.
    * Uses the same notation as you used for bash ("./" accesses the current folder, "../" searches up to the parent folder)
    * Argument is the name of the file to be read.
    * Assign result to a variable to store the data that was read.

In [17]:
import pandas

df = pandas.read_csv('../data/gapminder_gdp_asia.csv')
print(df)

               country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0          Afghanistan      779.445314      820.853030      853.100710   
1              Bahrain     9867.084765    11635.799450    12753.275140   
2           Bangladesh      684.244172      661.637458      686.341554   
3             Cambodia      368.469286      434.038336      496.913648   
4                China      400.448611      575.987001      487.674018   
5      Hong Kong China     3054.421209     3629.076457     4692.648272   
6                India      546.565749      590.061996      658.347151   
7            Indonesia      749.681655      858.900271      849.289770   
8                 Iran     3035.326002     3290.257643     4187.329802   
9                 Iraq     4129.766056     6229.333562     8341.737815   
10              Israel     4086.522128     5385.278451     7105.630706   
11               Japan     3216.956347     4317.694365     6576.649461   
12              Jordan     1546.907807

*   The columns in a data frame are the observed variables, and the rows are the observations.
*   Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

> ## File Not Found
>
> Our lessons store their data files in a `data` sub-directory,
> which is why the path to the file is `data/gapminder_gdp_oceania.csv`.
> If you forget to include `data/`,
> or if you include it but your copy of the file is somewhere else,
> you will get a [runtime error]({{ site.github.url }}/05-error-messages/)
> that ends with a line like this:
>
> ~~~
> OSError: File b'gapminder_gdp_oceania.csv' does not exist
> ~~~

---
## EXERCISE:

Hypothetically, the data a project you are working on is stored in a file called `microbes.csv`, which is located in a folder called `field_data`. You are doing analysis in a notebook called `analysis.ipynb`in a sibling folder called `thesis`. You're directory structure looks like this:
    ~~~
    your_home_directory
    +-- field_data/
    |   +-- microbes.csv
    +-- thesis/
        +-- analysis.ipynb
    ~~~

1. What value(s) should you pass to `read_csv` to read `microbes.csv` in `analysis.ipynb`?

## Use `DataFrame.info` to find out more about a data frame.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 13 columns):
country           2 non-null object
gdpPercap_1952    2 non-null float64
gdpPercap_1957    2 non-null float64
gdpPercap_1962    2 non-null float64
gdpPercap_1967    2 non-null float64
gdpPercap_1972    2 non-null float64
gdpPercap_1977    2 non-null float64
gdpPercap_1982    2 non-null float64
gdpPercap_1987    2 non-null float64
gdpPercap_1992    2 non-null float64
gdpPercap_1997    2 non-null float64
gdpPercap_2002    2 non-null float64
gdpPercap_2007    2 non-null float64
dtypes: float64(12), object(1)
memory usage: 288.0+ bytes


## Use `DataFrame.describe` to get summary statistics about data.

DataFrame.describe() gets the summary statistics of only the columns that have numerical data. 
All other columns are ignored.

In [7]:
print(df.describe())

       gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
count        2.000000        2.000000        2.000000        2.000000   
mean     10298.085650    11598.522455    12696.452430    14495.021790   
std        365.560078      917.644806      677.727301       43.986086   
min      10039.595640    10949.649590    12217.226860    14463.918930   
25%      10168.840645    11274.086022    12456.839645    14479.470360   
50%      10298.085650    11598.522455    12696.452430    14495.021790   
75%      10427.330655    11922.958888    12936.065215    14510.573220   
max      10556.575660    12247.395320    13175.678000    14526.124650   

       gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
count         2.00000        2.000000        2.000000        2.000000   
mean      16417.33338    17283.957605    18554.709840    20448.040160   
std         525.09198     1485.263517     1304.328377     2037.668013   
min       16046.03728    16233.717700    17632.410

---
## EXERCISE:
1. Read the data in `gapminder_gdp_americas.csv` (which should be in the same directory as `gapminder_gdp_oceania.csv`) into a variable called `americas` and display its summary statistics.

---

## The `DataFrame.columns` variable stores information about the data frame's columns.

*   Note that this is data, *not* a method.
    *   Like `math.pi`.
    *   So do not use `()` to try to call it.
*   Called a *member variable*, or just *member*.

In [9]:
print(df.columns)

Index(['country', 'gdpPercap_1952', 'gdpPercap_1957', 'gdpPercap_1962',
       'gdpPercap_1967', 'gdpPercap_1972', 'gdpPercap_1977', 'gdpPercap_1982',
       'gdpPercap_1987', 'gdpPercap_1992', 'gdpPercap_1997', 'gdpPercap_2002',
       'gdpPercap_2007'],
      dtype='object')


## Use `index_col` to specify that a column's values should be used as row headings.

*   Row headings are numbers (0 and 1 in this case).
*   Really want to index by country.
*   Pass the name of the column to `read_csv` as its `index_col` parameter to do this.

In [18]:
df = pandas.read_csv('../data/gapminder_gdp_oceania.csv', index_col='country')
print(df)

             gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  gdpPercap_1967  \
country                                                                       
Australia       10039.59564     10949.64959     12217.22686     14526.12465   
New Zealand     10556.57566     12247.39532     13175.67800     14463.91893   

             gdpPercap_1972  gdpPercap_1977  gdpPercap_1982  gdpPercap_1987  \
country                                                                       
Australia       16788.62948     18334.19751     19477.00928     21888.88903   
New Zealand     16046.03728     16233.71770     17632.41040     19007.19129   

             gdpPercap_1992  gdpPercap_1997  gdpPercap_2002  gdpPercap_2007  
country                                                                      
Australia       23424.76683     26997.93657     30687.75473     34435.36744  
New Zealand     18363.32494     21050.41377     23189.80135     25185.00911  


*   This is a `DataFrame`
*   Two rows named `'Australia'` and `'New Zealand'`
*   Twelve columns, each of which has two actual 64-bit floating point values.
    *   We will talk later about null values, which are used to represent missing observations.
*   Uses 208 bytes of memory.

*   Not particularly useful with just two records,
    but very helpful when there are thousands.

## Writing to csv file 
As well as the `read_csv` function for reading data from a file, Pandas can write a csv file with a `to_csv` function.
    
    
    df_copy = df

    df_copy.to_csv("../data/gapminder_gdp_oceania_copy.csv'
    

---
## EXERCISE:
1. Read in a new data frame for gapminder_gdp_africa.csv
2. Write out a copy of the data frame to a new file called gapminder_gdp_africa.csv.bak

---

# -- COMMIT YOUR WORK TO GITHUB --

---
## Keypoints:
 * Use the Pandas library to do statistics on tabular data.
 * Use `index_col` to specify that a column's values should be used as row headings.
 * Use `DataFrame.info` to find out more about a data frame.
 * The `DataFrame.columns` variable stores information about the data frame's columns.
 * Use `DataFrame.T` to transpose a data frame.
 * Use `DataFrame.describe` to get summary statistics about data.