# ===========================================================
# 05 Reading Tabular Data Using Pandas

# Objectives
- Import the Pandas library.
- Use Pandas to load a CSV data set.
- Get summary information from a Pandas DataFrame.
- Download online data using Pandas.

## Pandas
- Pandas is a widely-used Python library for statistics and plotting
- Its focus is tabular data
- It is similar to R in that it uses a structure called a dataframes.
- Dataframes can contain multiple data types

![Data Frame ](https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe.svg)

Source: <https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe.svg>

- Pandas can read all kinds of tabular data

![Data Processed by Pandas](https://pandas.pydata.org/pandas-docs/stable/_images/02_io_readwrite.svg)

Source: <https://pandas.pydata.org/pandas-docs/stable/_images/02_io_readwrite.svg>

In [None]:
# 1. Run this cell to download the data
# 2. Open the downloaded files to get a sense of the data

# Downloads a zip file from Carpentries webpage with Gapminder data
!wget http://swcarpentry.github.io/python-novice-gapminder/files/python-novice-gapminder-data.zip .
# Unzips the file
!unzip python-novice-gapminder-data.zip

- Load Pandas with `import pandas as pd`

In [None]:
# Import the pandas library
import pandas as pd

# Use `read_csv` to read the gapminder data
data = pd.read_csv('data/gapminder_gdp_oceania.csv')
print(data)

- The columns in a dataframe are the observed variables, and the rows are the observations.
- Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

## `index_col`
- Use `index_col` to specify that a column's values should be used as row identifiers.

In [None]:
data = pd.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)

- Use `DataFrame.info` to find out more about a dataframe.

In [None]:
# `info()` is a method of data
data.info()

*   This is a `DataFrame`
*   Two rows named `'Australia'` and `'New Zealand'`
*   Twelve columns, each of which has two actual 64-bit floating point values.
*   Uses 208 bytes of memory.

## Attributes
- The `DataFrame.columns` attribute stores information about the dataframe's columns.
- Note that this is a varaible, *not* a method.
  - It doesn't have `()`
*   Called a *member variable*, just a *member*, or an *attribute*.

In [None]:
print(data.columns)

- Use `DataFrame.T` to transpose a dataframe.

*   Sometimes want to treat columns as rows and vice versa.
*   Transpose doesn't copy the data, just changes the program's view of it.

In [None]:
print(data.T)

## Summary Statistics
- Use `DataFrame.describe` to get summary statistics about data.

- DataFrame.describe() gets the summary statistics of only the columns that have numerical data. 
  All other columns are ignored.

In [None]:
# 1. Print the summary statistics for our dataframe
print(data.describe())

# Exercise
1. `read_csv()` can download data directly from a webpage.
   Download a dataset called the Titanic Data Set by passing
   the following URL to `read_csv()` instead of a file path.
   Put the new dataframe in a variable called `titanic`.
2. Use `titanic.head()` to have a look at the new dataframe.

**Data URL:**
<https://github.com/pandas-dev/pandas/raw/master/doc/data/titanic.csv>

# Objectives
- Import the Pandas library.
- Use Pandas to load a CSV data set.
- Get summary information from a Pandas DataFrame.
- Download online data using Pandas.