<center> 
# R406: Using Python for data analysis and modelling

<br> <br> 

## <center> Pandas — interaction with data sources and IO

<br>

<center> **Andrey Vassilev**

<br> 


 

# Outline

1. Native Pandas functionality
  - reading and writing CSV files
  - reading and writing Excel files
2. Using `pandas_datareader` module
  - accessing Eurostat data
  - accessing World Bank data
  - accessing OECD data
3. Pointers to other functionalities (**Note:** just references here! )
   - SQL queries
   - Stata, SAS and SPSS (via `savReaderWriter`) files

In [None]:
import numpy as np
import pandas as pd

# Reading CSV files

The basic function for reading CSV files is `read_csv()`. In its most basic form it takes only a string containing the name of the file to be imported. File will be imported as a `DataFrame`.

In [None]:
# See https://github.com/fivethirtyeight/data/tree/master/college-majors
# for a description of the data
gradstudents = pd.read_csv("grad-students.csv")

In [None]:
gradstudents.head()

Some useful parameters (see [docs](http://pandas.pydata.org/pandas-docs/stable/io.html) for a full description):
 - Pandas will try to infer the separator but if you know your file uses a special delimiter, pass something like `sep = ";"`
 - If your data contains a header (=column names) but it is not positioned at row 1 (which is `header = 0` by default), you can skip the first few rows and pass something like `header = 3`. Pass `header = None` if you know your file contains no header.
 - More generally, you can pass something like `skiprows = 2` (skips the first two rows) or `skiprows = [0,2,3]` (skips specific row numbers) to skip rows at the beginning of a file. The parameter `skipfooter = n` skips the last `n` rows.
 - The parameter `names = ["Col1", "Col2"]` will ensure you get specific column names in your `DataFrame`.

You can read a file that is not in your current working directory. It is done like this:   
```mj = pd.read_csv(r"C:\Users\User\Downloads\majors-list.csv")
```

Or you can even pass a specific URL to retrieve your CSV from the web:

In [None]:
women_stem = pd.read_csv(r"https://github.com/fivethirtyeight/data/raw/master/college-majors/women-stem.csv")
women_stem.head(3)

# Writing data to CSV files

Data is written to a CSV file using the `to_csv()` method of a dataframe.

In [None]:
gsshort = gradstudents.iloc[0:5,[1,3,5]]
print(gsshort)
gsshort.to_csv("gsshort.csv", header=False)

The `to_csv()` method can also take parameters specifying:
 - the delimiter: `sep = ";"`
 - whether to write the column names as a table header (True by default) but can be `header = False`
 - whether to write the index (True by default), can be `index = False`

It can also take a path different than the current working directory.

# Reading Excel files

Reading an Excel file can be done with the `read_excel()` function. It take a filename or a URL and returns a `DataFrame`. It can take other arguments, such a specific sheet name to read the data from.

In [None]:
ratesraw = pd.read_excel(r"http://www.bankofengland.co.uk/statistics/Documents/dl/251115fsg.xls", sheetname= "Data")
ratesraw.head()

Other arguments can include skipping a specific number of rows, including a custom set of column names etc.

In [None]:
rates = pd.read_excel(r"251115fsg.xls", 
                      sheetname = "Data", header = None, skiprows = 4, 
                      names = ["date","r"])
rates.head()

We can also take the index from a specific column etc. In general, the approach and syntax are similar to those for CSV files.

In [None]:
rates1 = pd.read_excel(r"251115fsg.xls", 
                      sheetname = "Data", header = None, skiprows = 4, 
                      names = ["r"], index_col=0)
rates1.head()

# Writing Excel files

**Big time warning: If you have an Excel file with the same name, the method shown below will essentially delete it and recreate it, including only the data from your dataframe. It will NOT update only specific sheets or ranges in an existing spreadsheet. If you need more advanced functionality, such as writing data to a specific range in a specific sheet, look elsewhere (e.g. the `openpyxl` library). **

The dataframe method for writing to Excel is called `to_excel()`.

In [None]:
rates1["r"] += 1

In [None]:
rates1.to_excel("rates1.xlsx",sheet_name="rates")

Again, you can pass a number of parameters. For instance, as shown below, you can choose not to include the `DataFrame` index and columns. You can also specify a specific starting place in the sheet.

In [None]:
rates1.to_excel("rates1.xlsx", sheet_name="rates", header = False, 
                index = False, startcol=3, startrow=3)

# Getting data directly from statistical sources

- Pandas has an associated library called `pandas_datareader` which facilitates access to information from several popular data sources.
- Examples include (see [here](https://pandas-datareader.readthedocs.io/en/latest/remote_data.html) for the full list):
    - Eurostat      
    - World Bank
    - OECD
    - Yahoo! Finance
    - Google Finance
    - St.Louis FED (FRED)
- We shall look at the first three to get an idea how it works.

# Getting a Eurostat dataset

In [None]:
import pandas_datareader.data as web

In [None]:
# Table tps00001 reports the population of a country 
# on 1 January of the respective year.
df = web.DataReader("tps00001", 'eurostat')

In [None]:
df.head(3)

In [None]:
df.index

In [None]:
df.columns

## An aside on working with `MultiIndex` object

- The previous example shows that the columns are represented by a complex object, a `MultiIndex`. 
- This is essentially a nested structure of column names with headings, subheadings etc.
- Here are a few common operations on them

In [None]:
# Selection
df["Population on 1 January - total"]["Albania"]

In [None]:
# One more level, produces a Series
df["Population on 1 January - total"]["Albania"]["Annual"]

In [None]:
# Selecting several at once
df["Population on 1 January - total"][["Albania","Azerbaijan"]]

In [None]:
# You can reassign the columns to simplify the structure
df1 = df["Population on 1 January - total"][["Albania","Azerbaijan"]]
df1.columns = ["Albania","Azerbaijan"]
df1

# Getting a World Bank dataset

In [None]:
from pandas_datareader import wb

In [None]:
# Get final consumption expenditure (% of GDP) 
df = wb.download(indicator='NE.CON.TETC.ZS', country=["BG","GB","RU"],
                 start=2010,end=2015)

In [None]:
df.T

There are functions to get the country codes, search for keywords in indicator metadata etc.

# Getting an OECD dataset

In [None]:
import pandas_datareader.data as web

In [None]:
# Growth in GDP per capita, productivity and ULC
df = web.DataReader('PDB_GR', 'oecd')

In [None]:
df["Austria"]["Total hours worked"]

# Interfacing with a database

**Note: No code examples here. Given for general info.**

- Pandas has functions to retrieve information from databases and write back `DataFrame`s to databases.
- This relies on using the powerful `SQLAlchemy` library
- There are functions such as: 
   - `read_sql_table()` to retrieve a table from a database
   - `read_sql_query()` to run a query against the database
- A dataframe has a method `to_sql()` to write it as a table in the database.

# Working with Stata, SAS and SPSS files

**Note: No code examples here. Given for general info.**

- These file formats are relatively popular and you may have to read in and process data packaged in one of them.
- The respective native Pandas methods are:
  - `read_stata()`
  - `read_sas()`
- Pandas does not work natively with SPSS files. The `savReaderWriter` module provides IO functions to work with the `sav` format.
- As a general rule, if you need to export data from Python to another program, the safest choice is probably to export in plain-text format and then read it into the other application.