# Pandas Library

In [1]:
# import pandas as pd
import pandas as pd

# import create_engine from sqlalchemy
from sqlalchemy import create_engine

# import web scraping functions
from urllib.request import urlretrieve, urlopen, Request

## General Info and Commands
#### Basic Commands
- These commands assume a dataframe name of `df`
- Print the 'head' or first 5 rows
    - `print(df.head())` use the '.head()' method of a dataframe
    - works without the `print()` call
- Print the 'tail' or last 5 rows
    - `print(df.tail())`
    - works without the `print()` call
- Access the 'keys' of a dataframe
    - `df.keys()`
- Combining indexing with other methods
    - you have a dataframe that each key is associated with a dataframe (a dataframe of dataframes)
    - `print(df['key1'].head())` will print the head of the dataframe associate with 'key1'

## Working With File Types

#### Importing Flat Files
- Such as csv and txt files with rows and cols
- `dataframe = pd.read_csv(filename[, sep][, comment][, na_values][, nrows][, header])`
    - `sep` is pandas version of delimiter with default `','`
    - `comment` takes the char that comments appear after (for python it's '#')
    - `na_values` takes a list of strings to identify as NA or NaN
    - `nrows` specifies an integer for the number of rows to retrieve
    - `header=None` if no header
    - view the header and first 5 lines of the dataframe with `.head()` method `dataframe.head()`

#### Excel Files
- `datafile = pd.ExcelFile('filename')`
- view different sheets in the file/dataframe
    - `print(datafile.sheet_names)`
    - use `.sheet_names` attribute of this object
- extract a sheet into a dataframe
    - `dataframe = datafile.parse(sheet[, skiprows][, names][, usecols])`
        - `sheet` supply sheet name as a str or index as float (0 indexed)
        - the following args must be in list format
            - `[arg]` if only supplying one value
        - `skiprows` supply a list of rows to skip (0 indexed)
        - `names` supply a list of names for your imported columns
        - `usecols` supply a list of columns to import (0 indexed)
- Read Excel file and store each sheet as a dataframe with sheet names as the keys to each individual dataframe
    - `df = pd.read_excel('filename', sheetname=none)`
        - can specify a 'sheet' or if sheet='none' will save all sheets using sheet names as keys
        - can use a 'url' as the 'filename' to scrape data from the web

#### SAS and Stata Files
- SAS Files
    - `from sas7bdat import SAS7BDAT`
    - `with SAS7BDAT('filename.sas7bdat') as file:`
          `dfsas = file.to_data_frame()`
- Stata Files
    `data = pd.read_stata('filename.dta')`

#### HDF5 Files
- HDF5 is becoming the industry standard for big data sets
- hierachy of key values, where a value here then becomes a key
- `import h5py`
  `filename = filename.hdf5`
  `data = h5py.File(filename, 'r')`
- exploring data structure
    - `for key in data.keys():`
      `print(key)`
    - provides keys that can be accessed such as 'meta' for metadata
    - access its contents
        - `for key in data['meta'].keys():`
          `print(key)` returns another key in this example 'Description'
    - accessing values
        - `data['meta']['Description'].value`

#### Scraping Data fro the Web
- Some functionality using the 'urllib' package
    - `from urllib.request import urlretrieve, urlopen, Request`
    - import not necessary when using some of the functions below
- Import data into a dataframe using a url
    - `url = 'http://....filename.csv'`
    - `df = pd.read_csv(url, sep=';')` using the appropriate separator (delimiter)
    - `df = pd.read_excel(url, sheetname=none)` 

## Convert Dataframe to Numpy Array
- `array = dataframe.values`
    - these "values" must all be the same type

## Working with Databases
- Need to import the appropriate package
    - `from sqlalchemy import create_engine`
- Creating an engine (sqlalchemy package)
    - `engine = create_engine('sqlite:///db_name.sqlite')`
        - above syntax `'db_type:///db_name.extension'`
- Running a query using Pandas
    - `df = pd.read_sql_query("SELECT * FROM table_name", engine)
    - `engine` is the engine to connect to (see above)