# Setup

Ensure that our database is ready

In [None]:
%%bash
if [[ -d project-tycho-utilities ]];
then
  cd project-tycho-utilities/
  git pull
else
  git clone https://github.com/lgautier/project-tycho-utilities.git
  cd project-tycho-utilities/
fi
DBNAME=../tycho.db make all

Create a file with the content of our table `location`.
We will use it later.

In [None]:
import sqlite3
dbfilename = "tycho.db"
dbcon = sqlite3.connect(dbfilename)
cursor = dbcon.cursor()

sql = """
SELECT
  state, city
FROM
  location
"""

cursor.execute(sql)

import csv
with open('location.csv', 'w') as fh:
  csv_w = csv.writer(fh)
  csv_w.writerow(('state', 'city'))
  csv_w.writerows(cursor)

---

# Pandas DataFrame

## Building DataFrames

Pandas `DataFrame` is not-unlike SQL tables, R data frames,
or (well-structured) spreadsheets.

Data frames are, like with R, essentially an array of "columns".
`pandas.Series` objects are one dimensional arrays.

<!-- label:pandas_series -->

In [None]:
import pandas
import numpy

s = pandas.Series([1, 3, 5, numpy.nan, 6, 8])
s


The `Series` can be used as a "column" in a data frame. 

<!-- label:pandas_series_dataframe_1 -->

In [None]:

pandas.DataFrame(s)


The `Series` can be given a name in which case it will become the name
of the column.

<!-- label:pandas_series_dataframe_2 -->

In [None]:
s = pandas.Series([1, 3, 5, numpy.nan, 6, 8], name='measure_a')
pandas.DataFrame(s)


---

The constructor for `DataFrame` can be a little counter-intuitive when
wanting to built a multi-column DataFrame as it might consider the arguments
as sequences of rows or of columns depending on the data structure.
The documentation pandas will be your ally.


Here the constructor considers each `Series` as a row:

In [None]:
pandas.DataFrame([s,s])

Here the constructor considers each `Series` as a column:

<!-- label:pandas_dataframe_dict -->

In [None]:
pandas.DataFrame({'a': s, 'b': s})

---

`pandas.DataFrame` objects can also be built by reading data in CSV files.

**Note:** Like with regular R data.frame object, all data is loaded into memory.
This is obviously only working if there is enough memory on the machine used.

<!-- label:pandas_dataframe_read_csv -->

In [None]:
csv_filename = 'location.csv'
dataf = pandas.read_csv(csv_filename)

# Working with data frames

Types for the columnd are inferred. This is often acceptable for interactive work, but
can also lead to surprises.

In [None]:
dataf.dtypes

Visually inspection of few rows in the table is a common first step when working interactively.
This is often why one wants to "see the data in a spreadsheet".

In [None]:
dataf.head()

In [None]:
dataf.tail()

The size of the `DataFrame` (number of rows and columns)
is also a common early check:

In [None]:
dataf.shape

Column names.

In [None]:
dataf.columns

Summary statistics.

<!-- label:pandas_dataframe_describe -->

In [None]:
dataf.describe()

Filtering rows is a common operation when working with data. This is the `WHERE` clause
in SQL.

<!-- label:pandas_dataframe_filter -->

In [None]:
res = dataf[dataf.apply(lambda x: x['state'].startswith('M'), axis=1)]

print('Original shape: %r' % repr(dataf.shape))
print('After filter: %r' % repr(res.shape))

---

Sorting:

In [None]:
res = (dataf
       .sort_values('city', ascending=False))       

---

Like with SQL, tables can be joined using a key (this is like SQL's `INNER JOIN`).

<!-- label:pandas_dataframe_inner_join -->

In [None]:
# DataFrame with counts in a column "count_cities"
res = (res[['state', 'city']]
       .groupby('state')
       .count()
       .reset_index()
       .rename(columns={'city': 'count_cities'}))

# Join by state (since the counts are aggregates by state)
dataf_with_count = dataf.join(res, on='state', lsuffix='left')

---

Pivot. This is something usually hard(er) to achieve with SQL.

In [None]:

res = (res
       .pivot(index='state', columns='count_cities'))
res

---

# Read from SQL


<!-- label:pandas_dataframe_database -->

In [None]:
import sqlite3

# Read sqlite query results into a pandas DataFrame
dbcon = sqlite3.connect("tycho.db")

sql = """
SELECT state, city
FROM location
WHERE state LIKE 'M%'
"""

dataf = pandas.read_sql_query(sql, dbcon)

print(dataf.head())

---

In [None]:

res = (dataf
       .groupby('state')
       .count()
       .sort_values('city', ascending=False))
res

In [None]:
sql = """
SELECT state, count(city) AS ct
FROM location
WHERE state LIKE 'M%'
GROUP BY state
ORDER BY ct DESC
"""
res = pandas.read_sql_query(sql, dbcon)
res