# An introduction to `pandas` dataframes in Python

## Overview
You can use Python as part of your workflow to look at, transform, and analyze data!

[`pandas`](https://pandas.pydata.org/docs/) is a library for the Python programming language. It provides you with high-performance data structures and data analysis tools. 

We're going to use dataframes to work with tabular data, that's data structured into columns and rows. Think:

- Spreadsheets in Excel; 
- Comma-seperated value (CSV) and tab-seperated value (TSV) plain-text files;
- Tables in a relational database.

## Datasets
We can find data from the Hack HPC [data list](http://hackhpc.org/data/). I found the Governor's Office of Planning and Budget "Housing Units by County 2000-2011" [dataset](https://opb.georgia.gov/social-and-economic-data).

## Reading data from a flat file into a dataframe

Take a tabular file type (CSV, TSV, XLS, XLSX) and put it into a dataframe object to work with.

In [None]:
# Import `pandas` module so that we can use it's functions and objects.
import pandas as pd

In [None]:
# Read data from an Excel file into a new `pandas` dataframe object.
housing_df = pd.read_excel("Housing_Units-2000-2011.xls")

# Look at what's in the dataframe object.
housing_df

## Choosing a dataframe header

Something strange seems to be happening. We see what appear to be the years in the first row of data. And the header has column names that include "Unnamed: #".

In [None]:
# Read data from an Excel file into a new `pandas` dataframe object.
# This time, specify the header from the original file.
housing_df = pd.read_excel("Housing_Units-2000-2011.xls", header=1)

# Look at what's in the dataframe object.
housing_df

## Changing column names
Now the first column is "Unamed: 0", but we saw that it originally said "Housing Units Estimates: Georgia Counties, 2000-2011", and it seems to include the name of the county.

In [None]:
# Rename a column and assign the resulting dataframe back to the same variable.
housing_df = housing_df.rename(columns={"Unnamed: 0": "county_name",})

# Look at what's in the dataframe object.
housing_df

## Slicing rows using positional indexes
The data still look a little messy, we have:

- The first row with data for all of Georgia instead of just one county;
- A row filled with "NaN";
- The last row with information about where the data came from (aka metadata refering to the data source).

We can grab only the rows we care about, this is called "slicing". In this case we will use the the "positional index" of the rows to specify what we want to keep. 

On the left-hand side of the dataframe, we see that that `pandas` automatically added a numeric index in bold (the column without a name). We can use this number to get at a particular row or range of rows.

In [None]:
# Example of slicing a dataframe by specifying rows.
housing_df[0:3]

Note:

- Positional indexes start at '0'.
- Slicing by index is exclusive of the last index specified.

In [None]:
# Slice a dataframe and assign the resulting dataframe back to the same variable.
housing_df = housing_df[1:160]

# Look at what's in the dataframe object.
housing_df

## Setting your own index labels
What if we decide that we want each county name to be our index instead of these numbers? We might want this so that we can refer to the county names as our row labels.

First, we need to make sure that the counties in this dataset are unique. That is, that each one will uniquely identify a single row, and multiple rows don't accidentally (or intentionally) repeat county names.

In [None]:
# Make sure that the values in a column are unique, and there are no duplicates.
housing_df['county_name'].is_unique

### More about unique values

In [None]:
# Example: What would it look like to have duplicates?
counties_dup_df = pd.DataFrame(['Appling', 'Atkinson', 'Bacon', 'Appling', 'Baker'], columns=['county_name'])

# Look at what's in the dataframe object.
counties_dup_df

In [None]:
# Check if values in a column are unique.
counties_dup_df['county_name'].is_unique

In [None]:
# See the number of times each value occurs.
counties_dup_df['county_name'].value_counts()

### Setting the index

In [None]:
# Change the dataframe index to a column and assign the resulting dataframe back to the same variable.
housing_df = housing_df.set_index('county_name')

# Look at the first few rows of the dataframe object, instead of the whole thing.
housing_df.head(5)

In [None]:
# Look at the last few rows of the dataframe object, instead of the whole thing.
housing_df.tail(5)

##  Selecting values using row and column labels

Now, the new index we created gives us row labels that are text (county names) instead of numbers. We can use the new labels to select and slice rows, instead of the positional index. 

We can also specify both rows *and* columns that we want to select using `DataFrame.loc[row_label, column_label]`. 

For rows these labels will be the index we specified, and for columns they will be the column name.

In [None]:
# Choose a subset of data using labels.
housing_df.loc['Treutlen':'Twiggs',[2000, 2011]]

## Math functions on columns

Let's answer the question: How much has the number of housing units changed for each county from 2000 to 2011?

In [None]:
# Subtract the values in one column from the values in anothe.
# Assign the result (a series) to a new column in the same dataframe.
housing_df['housing_change'] = housing_df[2011] - housing_df[2000]

# Look at the first few rows of the dataframe object.
housing_df.head(10)

## Subsetting data using criteria

It looks like some counties actually had fewer housing units in 2011 than 2000. Let's see how we can find all of those, and answer the question: Which counties had fewer housing units in 2011 than in 2000? 

To do this, we can subset data based on criteria. 

- Equals: ==
- Not equals: !=
- Greater than, less than: > or <
- Greater than or equal to: >=
- Less than or equal to: <=

In [None]:
# Subset data based on criteria using a "boolean mask". 
# Assign the resulting dataframe to a new variable.
fewer_housing_units_df = housing_df[housing_df['housing_change'] < 0]

# Look at what's in the dataframe object.
fewer_housing_units_df

### More about using Boolean masks

In [None]:
# Get a boolean (True/False) series based on the dataframe.
housing_df['housing_change'] < 0

In [None]:
# Assign the series to a variable.
mask = housing_df['housing_change'] < 0

In [None]:
# Use the variable to choose rows from the dataframe that match with "True".
housing_df[mask]

In [None]:
# Do it all in one line.
housing_df[housing_df['housing_change'] < 0]

## Subsetting data using column names

In [None]:
# Choose a subset of data using column names and assign the resulting dataframe to the same variable.
fewer_housing_units_df = fewer_housing_units_df[[2000, 2011,'housing_change']]

# Look at what's in the dataframe object.
fewer_housing_units_df

## Writing data to a file

In [None]:
# Write dataframe to a CSV file.
fewer_housing_units_df.to_csv("counties_with_fewer_housing_units.csv")

## Merging data from two dataframes

You can merge (aka "join") two dataframes using the `pd.merge()` function in order to combine data into a single dataframe. Before doing this, we need to get a new dataset and clean it up.

I happen found the Governor's Office of Student Acheivement, Georgia School Grade Reports [datasets](https://schoolgrades.georgia.gov/dataset), and downloaded School-Level Data for the 2019 year.

In [None]:
# Read data from a CSV file into a new `pandas` dataframe object.
schools_df = pd.read_csv("school-19.csv")

# Look at what's in the dataframe object.
schools_df

In [None]:
# Look at all the column names.
schools_df.columns

In [None]:
# Choose a subset of data using column names and assign the resulting dataframe to the same variable.
schools_df = schools_df[['SystemName', 
                         'SchoolName', 
                         'Zip_Code',
                         'total_enroll', 
                         'Grades', 
                         'Grade']]

# Look at the first few rows of the dataframe object.
schools_df.head(15)

### Tangent: A rant about naming things

This is crazy! To use this data, you not only need to know the column names but also keep track of how each one is written, because you can't assume that a standard naming convention is being used.

**?!?** The columns we pulled out use three different naming conventions **?!?**

- Capitalized words with no space indicator ("SystemName")
- Capitalized words with underscores ("\_") for spaces ("Zip_Code")
- All lower case with underscores ("\_") for spaces ("total_enroll")


It's worth learning about [Tidy Data](https://vita.had.co.nz/papers/tidy-data.pdf) to make your life easier. Choose one convention, name/rename columns, and sticking to it.

### Changing values in a column (a quick look)

We can also find and replace values in a column. In this data, we want to remove the word "County" after the county names in the `SystemName` column.

Learn more about the power of [`DataFrame.replace()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) and regular expressions (regex)!

In [None]:
# Replace part of a string with nothing.
schools_df = schools_df.replace(to_replace=" County", value="", regex=True)

# Look at the first few rows of the dataframe object.
schools_df.head(15)

### Merging dataframes

In [None]:
# Merge dataframes on the values in a particular column that you expect them to have in common.
merged_df = pd.merge(left=schools_df, right=fewer_housing_units_df, left_on='SystemName', 
                        right_on='county_name')

# Look at what's in the dataframe.
merged_df

# Getting data into and out of a database with Python

## Intro
In some cases the data you want to access may be in a database, or you may want to put data into a database.

Since I'm talking about tabular data, I'm going to focus on relational databases. These databases make sense when the amount of data you're working with gets large, and a system for managing the relationships between data in different tables matters.

Here we'll assume we have access to an existing database. For reference, a "simple" database engine you can start with is SQLite, learn more about SQLite [here](https://www.sqlite.org/index.html).

## Database adapters
If you're already using Python to manipulate data, you can use an adapter to connect to a database within your code:

- `sqlite3` for [SQLite databases](https://docs.python.org/3/library/sqlite3.html)
- `psycopg2` for [PostgreSQL databases](https://www.psycopg.org/docs/usage.html)
- `mysql-connector-python` for [MySQL databases](https://www.w3schools.com/python/python_mysql_getstarted.asp)

## Overview
The `sqlite3` module provides an interface for interacting with SQLite databases. It'll be my example for how using  database adapters works in Python.

1. First, create a Connection object ito represent the database using `sqlite3.connect()`. 
2. Once you have a Connection, create a Cursor object with the `.cursor()`.
3. The Cursor can perform all kinds of SQL (Structured Query Language) commands with the `.execute()` method.
4. Use `.commit()` to optionally save changes to the database.
5. When done, close the connection with `.close()`.

**Flow**: open connection -> create cursor -> execute SQL commands -> *commit changes ->* close connection


In [None]:
# Import the `sqlite3` module.
import sqlite3

In [None]:
# Connect to a database.
# Note: SQLite databases are files, database engines with servers will require more parameters in order to connect.
conn = sqlite3.connect('example.db')

In [None]:
# Create a cursor to execute commands.
c = conn.cursor()

### Using the cursor to work with the database connection

In [None]:
# Execute an SQL query to create a new database table.
sql_create_table = '''
    CREATE TABLE tree
    (id INT PRIMARY KEY NOT NULL,
    name TEXT NOT NULL,
    description TEXT,
    rating REAL);
    '''  

c.execute(sql_create_table)

In [None]:
# Execute an SQL query to add data (one record) into a table.

# The `sqlite3` module uses "?" as a placeholder wherever you want to use a value. 
# You then provide a tuple of values as the second argument to the cursor’s `execute()` method.
# Note: Other database modules may use a different placeholder, for example `psycopg2` uses "%s".
tree_record = (1, 'Sassafras', "mitten-shaped and trilobed leaves", 8)
sql_insert = "INSERT INTO tree (id, name, description, rating) VALUES (?,?,?,?);"

c.execute(sql_insert, tree_record)

In [None]:
# Execute an SQL query to add data (multiple records) into a table.
tree_records = [(2, 'American Hornbeam', 'muscular trunk', 7.75),
                (3, 'Flowering Dogwood', 'stinky flowers', 6.50),
                (4, 'Bald Cypress', 'knobby knees', 10),
                (5, 'Lacebark Elm', 'flaky bark', 6.25),  
            ]
c.executemany('INSERT INTO tree VALUES (?,?,?,?)', tree_records)

In [None]:
# Commit changes to the database to make them persistent across sessions.
conn.commit()

In [None]:
# Execute an SQL query to get data from an existing database table.
c.execute("SELECT * FROM tree;")

# Fetch the results.
c.fetchall()

`cursor.fetchall()` fetches all the rows of a query result. It returns all the rows as a list of tuples. An empty list is returned if there is no record to fetch.

`cursor.fetchmany(size)` returns the number of rows specified by size argument. When called repeatedly this method fetches the next set of rows of a query result and returns a list of tuples. If no more rows are available, it returns an empty list.

`cursor.fetchone()` method returns a single record or None if no more rows are available.

In [None]:
# Or, treat the cursor as an "iterator".
for row in c.execute("SELECT * FROM tree;"):
    print(row)

### Using `pandas` to work with a database connection

In [None]:
# Use `pandas` to read data from a database table directly into a dataframe.
tree_df = pd.read_sql_query(
    '''
    SELECT * FROM tree;
    ''',
    conn)

# Look at what's in the dataframe.
tree_df

In [None]:
# Write data from a dataframe directly into a database table.
schools_df.to_sql("schools", conn)

### Back to the cursor

In [None]:
# Use cursor as an "iterator" to see rows in the new database table.
for row in c.execute(
    '''
    SELECT 
        SchoolName,
        Grades,
        Grade
    FROM schools
    WHERE Grades is "9-12" and Grade is "A";
    '''
):
    print(row)

### Closing the database connection

In [None]:
# Close the cursor and database connection.
if(conn):
    c.close()
    conn.close()
    print("The database connection is closed.")