# Using Python for Data Analysis

## Fundamental usage of Python

We'll need to understand some very basic Python in order to do this tutorial.

In [None]:
'Hello World'  # Change the letters and press on your keyboard control + Enter

In [None]:
'你好世界'  # Or press the ▶️Run button above

#### Assigning variables to values

You can store variables for re-use. You've hopefully seen this in other programming languages:

In [None]:
# You can assign values to variables
a = 1
print('Print a single number:')
print(a)
print('Arithmetic operations, too:')
print(a + a)

#### Quick review of types

A few types we'll focus on:
- `Booleans`, which indicate whether something is true or false
- `Strings`, which represent a group of characters, like `'hey'`
- Whole numbers like `7` or `101` (called *integers*) and fractions like `7.3` and `3.41` (called here *floats*)

In [None]:
b = 'Hello everyone'
c = '大家好'
print(b)
print(c)

In [None]:
'大' in '大家好'

In [None]:
b.lower()

In [None]:
b[0]  # Get the first letter of string, note that 0 indicates the  first

In [None]:
b[6:] # Get the seventh character until the end

## Pandas Overview

Pandas is a widely-used library for data analyis. We'll be using a small subset of its features for this talk.

First, execute the next cell to load up the `pandas` and `seaborn` libraries.

In [None]:
# If you're running this "locally" (on your computer, outside of this lecture and not on Binder),
# you will need to install these libraries
import pandas as pd
import seaborn as sns

In [None]:
# Loading in a CSV
# Read a file from a URL
AIR_TRAFFIC_URL = 'https://raw.githubusercontent.com/ajduberstein/sf_public_data/master/Air_Traffic_Passenger_Statistics.csv'
FLOOD_DATA_URL = 'https://raw.githubusercontent.com/ajduberstein/dartmouth_flood_data/master/floods.csv'
floods = pd.read_csv(FLOOD_DATA_URL)
sfo = pd.read_csv(AIR_TRAFFIC_URL)
# Read a file from your local file system
floods_over_time = pd.read_csv('./floods_by_cause_by_year.csv')

In [None]:
# See the first 5 rows
floods.head()

In [None]:
# See the last 5 rows in any data set
floods.tail()

In [None]:
# Get summary statistics
floods.describe()

In [None]:
# See the names of all the columns
floods.columns

In [None]:
# Select a single colmn
floods['area']

In [None]:
# Get a histogram for a single column
floods['area'].hist()

In [None]:
# You're also in Python, so you can call in other Python functions
import math
# Apply a log scale to the histogram

# In case logs are murky:
# This is essentially counting the number of digits - 1
# math.log10(100) == 2
# math.log10(1000) == 3
# etc

floods['area'].apply(math.log10).hist()

In [None]:
# Aggregation
floods.count()

In [None]:
# Your turn! How many rows are in the SFO data?

In [None]:
# Aggregate by group
floods.groupby('main_cause').count().head()

In [None]:
# Sorting and chaining functions - this would give us the top 10 causes of floods by frequency
grouped_floods = floods.groupby('main_cause').count()
grouped_floods.sort_values('id', ascending=False).head()

In [None]:
# Your turn: Plot the data in an ascending order
# grouped_floods.sort_values('id', ascending=True).head()

In [None]:
# This will be easier to plot
import math

def safe_log10(num):
    if num >= 1:
        return math.log10(num)
    return num

floods['log10_displaced'] = floods['displaced'].apply(safe_log10)

## Charts

### relplots

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set(rc={'figure.figsize':(24.7, 8.6)})
plt.figure(figsize=(45,10))

In [None]:
# Correlations in Seaborn
sns.relplot(x='log10_displaced', y='area', data=floods, aspect=3)

In [None]:
sns.relplot(x='lng', y='lat', alpha=0.1, data=floods, aspect=2)

In [None]:
# Create your own function to recode the data

def recode_cause(cause):
    cause = str(cause).lower()
    if 'monsoon' in cause:
        return 'MONSOON'
    elif 'rain' in cause:
        return 'RAIN'
    elif 'melt' in cause:
        return 'SNOWMELT'
    elif 'tropical storm' in cause:
        return 'TROPICAL STORM'
    else:
        return 'OTHER'


floods['cause_recoded'] = floods['main_cause'].apply(recode_cause)
floods.head()

In [None]:
# Show relationship between two variables
sns.relplot(x='lng', y='lat', alpha=0.2, data=floods, hue='cause_recoded', aspect=3)

In [None]:
import pydeck

# Same data on an interactive map
color_lookup = pydeck.data_utils.assign_random_colors(floods['cause_recoded'])
floods['rgb'] = floods['cause_recoded'].apply(lambda x: color_lookup[x])

scatter = pydeck.Layer(
    'ScatterplotLayer',
    data=floods,
    get_position='[lng, lat]',
    get_radius='30000 * severity',
    get_fill_color='rgb',
    pickable=True
)
geojson = pydeck.Layer(
    'GeoJsonLayer',
    data='https://datahub.io/core/geo-countries/r/countries.geojson',
    line_width_min_pixels=1,
    get_radius=10000,
    stroked=True,
    extruded=False,
    filled=True,
    get_line_color=[255, 255, 255, 255]
)
pydeck.Deck(layers=[geojson, scatter], tooltip=True).to_html()

### Barcharts

In [None]:
import matplotlib.pyplot as plt

# Relative comparisons
sp = sns.barplot(
    x='cause_recoded',
    y='log10_displaced',
    data=floods)

sns.set(font_scale=2)
sp.set_xticklabels(sp.get_xticklabels(), rotation=30)
sp.set(
    xlabel='Cause',
    ylabel='Log of # of People Displaced',
    title='Relative Distributions of Flood Causes')

In [None]:
# For ease of use, I'll recode a date string
sfo['datetime'] = sfo['Activity Period'].apply(lambda x: str(x)[:4] + '-' + str(x)[4:] + '-01')

In [None]:
# Time series
sns.set(style="whitegrid")
df = sfo.groupby(['datetime']).sum()['Passenger Count']
df = df.reset_index()
sp = sns.lineplot(
    x='datetime',
    y='Passenger Count',
    data=df,
    linewidth=2)
labels = [x if x.endswith('-12-01') or x.endswith('-06-01') else '' for x in df['datetime']]
sp.set_xticklabels(labels, rotation=30)
sp

In [None]:
import numpy as np
df = sfo.groupby(['datetime', 'Price Category Code']).sum()['Passenger Count']
df = df.reset_index()
pivoted = pd.pivot_table(
    data=df,
    index='datetime',
    columns='Price Category Code',
    values='Passenger Count',
    aggfunc=np.sum)
c = sns.lineplot(data=pivoted, palette="tab10", hue='Price Category Code', linewidth=2.5)
c.set_xticklabels(c.get_xticklabels(), rotation=30)
c

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
# Heatmap from Seaborn docs

# Load the example flights dataset and convert to long-form
flights_long = sns.load_dataset("flights")
flights = flights_long.pivot("month", "year", "passengers")

# Draw a heatmap with the numeric values in each cell
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(flights, annot=True, fmt="d", linewidths=.5, ax=ax)

In [None]:
# Subtle errors:

sns.lineplot(data=floods_over_time.set_index('began'))

In [None]:
line_plot = sns.lmplot(
    data=sfo,
    x='Activity Period',
    y='Passenger Count',
    hue='GEO Region',
    aspect=3)

In [None]:
# Your turn
# Can you generate a chart of the passegner count by GEO Region?