# Lab 5

# Data Manipulation with Pandas

In this lab, you'll be working through Chapter 3 to get an introduction to the data manipulation and analysis package for Python, Pandas. This notebook is made up of two sections.

- Section 1: Work through the code samples in Chapter 3
- Section 2: Exercises

# Section 1: Code Practice

In this section, you will be reading through the various chapter sections and **typing out**/running the code samples given in the sections. The purpose of this is for you to practice using Jupyter to run Python code as well as learn about the functionality available to you in both IPython and Jupyter.

**Do not copy/paste the code**. Type it out. Don't go zen, either. Pay attention to the meaning of what you are typing. Pay attention to the parameters and the types of arguments. Find the similarities and differences among the various object APIs. 

## The hardest part of Pandas is the *massive* API.

The only way to become proficient is to **actually, physically, viscerally** use it. Repeatedly and deliberately over time.

---

##### Executing code in Jupyter

When typing and executing code in Jupyter, it is helpful to know the various keyboard shortcuts. You can find the full list of these by clicking **Help &rarr; Keyboard Shortcuts** in the menu. However, the two most useful keyboard shortcuts are:

- `Shift-Enter`: Execute the current cell and advance to the next cell. This will create one if none exists, but if a cell exists below your current cell, a new cell will **not** be created.
- `Alt-Enter`: Execute the current cell and **create** a new cell below.
- `Control-Enter`: Execute the current cell without advancing to the next cell

When writing your code, you will be using these two commands to make sure input/output (`In`/`Out`) is consistent with what is found in the chapter. If you create a cell by mistake, you can always go to **Edit &rarr; Delete Cells** to remove it.

#### Purpose of Section 1

Your purpose in this section is 

- **Type out** the code examples from the chapter (do not copy and paste)
- **Run** them
- **Check** to **make sure** you are getting the same results as what is contained in the chapter

---




## Introducing Pandas Objects

[Chapter/Section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/03.01-Introducing-Pandas-Objects.ipynb)

### The Pandas Series Object

In [None]:
import numpy as np
import pandas as pd

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

In [None]:
data.values

In [None]:
data[1]

In [None]:
data[1:3]

In [None]:
data.index

#### `Series` as generalized NumPy array

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
data['b']

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

In [None]:
data[5]

#### `Series` as specialized dictionary

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
population['California']

In [None]:
population['California':'Illinois']

#### Constructing `Series` objects

In [None]:
pd.Series([2, 4, 6])

In [None]:
pd.Series(5, index=[100, 200, 300])

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'})

In [None]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

### The Pandas `DataFrame` Object

#### `DataFrame` as a generalized NumPy array

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

In [None]:
states.index

In [None]:
states.columns

#### `DataFrame` as specialized dictionary

In [None]:
states['area']

#### Constructing `DataFrame` objects


In [None]:
pd.DataFrame(population, columns=['population'])

In [None]:
data = [{'a': i, 'b': 2 * i}
        for i in range(3)]
pd.DataFrame(data)

In [None]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

In [None]:
pd.DataFrame({'population': population,
              'area': area})

In [None]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

In [None]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A


In [None]:
pd.DataFrame(A)

### The Pandas `Index` Object

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

#### Index as immutable array

In [None]:
ind[1]

In [None]:
ind[::2]

In [None]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

In [None]:
ind[1] = 0


#### Index as ordered set

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB  # intersection

In [None]:
indA | indB  # union

In [None]:
indA ^ indB  # symmetric difference

---

## Data Indexing and Selection

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/03.02-Data-Indexing-and-Selection.ipynb)

### Data Selection in `Series`

#### `Series` as dictionary

In [None]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data


In [None]:
data['b']

In [None]:
'a' in data

In [None]:
data.keys()

In [None]:
list(data.items())

In [None]:
data['e'] = 1.25
data

#### `Series` as one-dimensional array

In [None]:
# slicing by explicit index
data['a':'c']

In [None]:
# slicing by implicit integer index
data[0:2]

In [None]:
# masking
data[(data > 0.3) & (data < 0.8)]

In [None]:
# fancy indexing
data[['a', 'e']]

#### Indexers: `loc`, `iloc`, and `ix`

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

In [None]:
# explicit index when indexing
data[1]

In [None]:
# implicit index when slicing
data[1:3]

In [None]:
data.loc[1]

In [None]:
data.loc[1:3]

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

### Data Selection in `DataFrame`

#### `DataFrame` as a dictionary

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

In [None]:
data['area']

In [None]:
data.area

In [None]:
data.area is data['area']

In [None]:
data['density'] = data['pop'] / data['area']
data

In [None]:
data.pop is data['pop']

#### `DataFrame` as two-dimensional array

In [None]:
data.values

In [None]:
data.T

In [None]:
data.values[0]

In [None]:
data['area']

In [None]:
data.iloc[:3, :2]

In [None]:
data.loc[:'Illinois', :'pop']

In [None]:
data.loc[:'Illinois', :'pop']

In [None]:
data.loc[data.density > 100, ['pop', 'density']]

In [None]:
data.iloc[0, 2] = 90
data


#### Additional indexing conventions

In [None]:
data['Florida':'Illinois']

In [None]:
data[1:3]

In [None]:
data[data.density > 100]

---

## Operating on Data in Pandas

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/03.03-Operations-in-Pandas.ipynb)

### UFuncs: Index Preservation

In [None]:
import pandas as pd
import numpy as np

In [None]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser

In [None]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

In [None]:
np.exp(ser)

In [None]:
np.sin(df * np.pi / 4)

### UFuncs: Index Alignment

#### Index alignment in `Series`

In [None]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')

In [None]:
population / area

In [None]:
area.index | population.index

In [None]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

In [None]:
A.add(B, fill_value=0)

#### Index alignment in `DataFrame`

In [None]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

In [None]:
B = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                 columns=list('BAC'))
B

In [None]:
A + B
A

In [None]:
fill = A.stack().mean()
A.add(B, fill_value=fill)

### UFuncs: Operations Between `DataFrame` and `Series`

In [None]:
A = rng.randint(10, size=(3, 4))
A

In [None]:
A - A[0]

In [None]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

In [None]:
df.subtract(df['R'], axis=0)

In [None]:
halfrow = df.iloc[0, ::2]
halfrow

In [None]:
df - halfrow

---

## Handling Missing Data

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/03.04-Missing-Values.ipynb)

### `None`: Pythonic missing data

In [None]:
import numpy as np
import pandas as pd

In [None]:
vals1 = np.array([1, None, 3, 4])
vals1

In [None]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

In [None]:
vals1.sum()

### `NaN`: Missing numerical data

In [None]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

In [None]:
1 + np.nan

In [None]:
0 *  np.nan

In [None]:
vals2.sum(), vals2.min(), vals2.max()

In [None]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2)

### `NaN` and `None` in Pandas

In [None]:
pd.Series([1, np.nan, 2, None])

In [None]:
x = pd.Series(range(2), dtype=int)
x

In [None]:
x[0] = None
x

### Operating on Null Values

#### Detecting null values

In [None]:
data = pd.Series([1, np.nan, 'hello', None])

In [None]:
data.isnull()

In [None]:
data[data.notnull()]

#### Dropping null values

In [None]:
data.dropna()

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df


In [None]:
df.dropna()

In [None]:
df.dropna(axis='columns')

In [None]:
df[3] = np.nan
df

In [None]:
df.dropna(axis='rows', thresh=3)

In [None]:
df.dropna(axis='columns', how='all')

#### Filling null values

In [None]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

In [None]:
data.fillna(0)

In [None]:
# forward-fill
data.fillna(method='ffill')

In [None]:
# back-fill
data.fillna(method='bfill')

In [None]:
df

In [None]:
df.fillna(method='ffill', axis=1)

---

## Hierarchical Indexing

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/03.05-Hierarchical-Indexing.ipynb)

In [None]:
import pandas as pd
import numpy as np

### A Multiply Indexed `Series`

#### The bad way

In [None]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

In [None]:
pop[('California', 2010):('Texas', 2000)]

In [None]:
pop[[i for i in pop.index if i[1] == 2010]]

### The Better Way: Pandas `MultiIndex`

In [None]:
index = pd.MultiIndex.from_tuples(index)
index

In [None]:
pop = pop.reindex(index)
pop

In [None]:
pop[:, 2010]


#### `MultiIndex` as extra dimension

In [None]:
pop_df = pop.unstack()
pop_df


In [None]:
pop_df.stack()

In [None]:
pop_df = pd.DataFrame({'total': pop,
                       'under18': [9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
pop_df

In [None]:
f_u18 = pop_df['under18'] / pop_df['total']
f_u18.unstack()

### Methods of `MultiIndex` Creation

In [None]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

In [None]:
data = {('California', 2000): 33871648,
        ('California', 2010): 37253956,
        ('Texas', 2000): 20851820,
        ('Texas', 2010): 25145561,
        ('New York', 2000): 18976457,
        ('New York', 2010): 19378102}
pd.Series(data)

#### Explicit `MultiIndex` constructors

In [None]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

In [None]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

In [None]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

In [None]:
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
              labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

#### `MultiIndex` level names

In [None]:
pop.index.names = ['state', 'year']
pop

#### `MultiIndex` for columns

In [None]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

In [None]:
health_data['Guido']

### Indexing and Slicing a `MultiIndex`

#### Multiply indexed `Series`

In [None]:
pop

In [None]:
pop['California', 2000]

In [None]:
pop['California']

In [None]:
pop.loc['California':'New York']

In [None]:
pop[:, 2000]

In [None]:
pop[pop > 22000000]

In [None]:
pop[['California', 'Texas']]

#### Multiply indexed `DataFrame`'s

In [None]:
health_data['Guido', 'HR']

In [None]:
health_data.iloc[:2, :2]

In [None]:
health_data.loc[:, ('Bob', 'HR')]

In [None]:
health_data.loc[(:, 1), (:, 'HR')]

In [None]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]


In [None]:
health_data

### Rearranging Multi-Indices

#### Sorted and unsorted indices

In [None]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

In [None]:
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)


In [None]:
data = data.sort_index()
data

In [None]:
data['a':'b']

#### Stacking and unstacking indices

In [None]:
pop.unstack(level=0)

In [None]:
pop.unstack(level=1)


In [None]:
pop.unstack().stack()

#### Index setting and resetting

In [None]:
pop_flat = pop.reset_index(name='population')
pop_flat


In [None]:
pop_flat.set_index(['state', 'year'])

### Data Aggregations on Multi-Indices

In [None]:
health_data

In [None]:
data_mean = health_data.mean(level='year')
data_mean

In [None]:
data_mean.mean(axis=1, level='type')

---

## Combining Datasets: Concat and Append

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/03.06-Concat-And-Append.ipynb)

In [None]:
import pandas as pd
import numpy as np

In [None]:
def make_df(cols, ind):
    """Quickly make a DataFrame"""
    data = {c: [str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data, ind)

# example DataFrame
make_df('ABC', range(3))

In [None]:
class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)
    

### Recall: Concatenation of NumPy Arrays

In [None]:
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])

In [None]:
x = [[1, 2],
     [3, 4]]
np.concatenate([x, x], axis=1)

### Simple Concatenation with `pd.concat`

In [None]:
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

In [None]:
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')

In [None]:
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")

#### Duplicate indices

In [None]:
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index  # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')

In [None]:
try:
    pd.concat([x, y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

In [None]:
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')

In [None]:
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")

#### Concatenation with joins

In [None]:
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')

In [None]:
display('df5', 'df6',
        "pd.concat([df5, df6], join='inner')")

In [None]:
display('df5', 'df6',
        "pd.concat([df5, df6], join_axes=[df5.columns])")

#### The `append()` method

In [None]:
display('df1', 'df2', 'df1.append(df2)')

---

## Combining Datasets: Merge and Join

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/03.07-Merge-and-Join.ipynb)

In [None]:
import pandas as pd
import numpy as np

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

### Categories of Joins

#### One-to-one joins

In [None]:
df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})
display('df1', 'df2')

In [None]:
df3 = pd.merge(df1, df2)
df3

#### Many-to-one joins

In [None]:
df4 = pd.DataFrame({'group': ['Accounting', 'Engineering', 'HR'],
                    'supervisor': ['Carly', 'Guido', 'Steve']})
display('df3', 'df4', 'pd.merge(df3, df4)')

#### Many-to-many joins

In [None]:
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
                              'Engineering', 'Engineering', 'HR', 'HR'],
                    'skills': ['math', 'spreadsheets', 'coding', 'linux',
                               'spreadsheets', 'organization']})
display('df1', 'df5', "pd.merge(df1, df5)")

### Specification of the Merge Key

#### The `on` keyword

In [None]:
display('df1', 'df2', "pd.merge(df1, df2, on='employee')")

#### The `left_on` and `right_on` keywords

In [None]:
df3 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'salary': [70000, 80000, 120000, 90000]})
display('df1', 'df3', 'pd.merge(df1, df3, left_on="employee", right_on="n

In [None]:
pd.merge(df1, df3, left_on="employee", right_on="name").drop('name', axis

#### The `left_index` and `right_index` keywords

In [None]:
df1a = df1.set_index('employee')
df2a = df2.set_index('employee')
display('df1a', 'df2a')

In [None]:
display('df1a', 'df2a',
        "pd.merge(df1a, df2a, left_index=True, right_index=True)")

In [None]:
display('df1a', 'df2a', 'df1a.join(df2a)')

In [None]:
display('df1a', 'df3', "pd.merge(df1a, df3, left_index=True, right_on='na

### Specifying Set Arithmetic for Joins

In [None]:
df6 = pd.DataFrame({'name': ['Peter', 'Paul', 'Mary'],
                    'food': ['fish', 'beans', 'bread']},
                   columns=['name', 'food'])
df7 = pd.DataFrame({'name': ['Mary', 'Joseph'],
                    'drink': ['wine', 'beer']},
                   columns=['name', 'drink'])
display('df6', 'df7', 'pd.merge(df6, df7)')

In [None]:
pd.merge(df6, df7, how='inner')
name	food	drink
0	Mary	bread	wine

In [None]:
display('df6', 'df7', "pd.merge(df6, df7, how='outer')")

In [None]:
display('df6', 'df7', "pd.merge(df6, df7, how='left')")

### Overlapping Column Names: The `suffixes` Keyword

In [None]:
df8 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [1, 2, 3, 4]})
df9 = pd.DataFrame({'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'rank': [3, 1, 4, 2]})
display('df8', 'df9', 'pd.merge(df8, df9, on="name")')

In [None]:
display('df8', 'df9', 'pd.merge(df8, df9, on="name", suffixes=["_L", "_R"

### Example: US States Data

`curl` will not work on Windows machines. Execute the following cell instead.

In [None]:
import pandas as pd
pop = pd.read_csv('https://belhavencs.nyc3.digitaloceanspaces.com/csc311/state-population.csv')
areas = pd.read_csv('https://belhavencs.nyc3.digitaloceanspaces.com/csc311/state-areas.csv')
abbrevs = pd.read_csv('https://belhavencs.nyc3.digitaloceanspaces.com/csc311/state-abbrevs.csv')

display('pop.head()', 'areas.head()', 'abbrevs.head()')

In [None]:
# Following are shell commands to download the data
# !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-population.csv
# !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master/state-areas.csv
# !curl -O https://raw.githubusercontent.com/jakevdp/data-USstates/master

In [None]:
pop = pd.read_csv('data/state-population.csv')
areas = pd.read_csv('data/state-areas.csv')
abbrevs = pd.read_csv('data/state-abbrevs.csv')

display('pop.head()', 'areas.head()', 'abbrevs.head()')

In [None]:
merged = pd.merge(pop, abbrevs, how='outer',
                  left_on='state/region', right_on='abbreviation')
merged = merged.drop('abbreviation', 1) # drop duplicate info
merged.head()

In [None]:
merged.isnull().any()

In [None]:
merged[merged['population'].isnull()].head()

In [None]:
merged.loc[merged['state'].isnull(), 'state/region'].unique()

In [None]:
merged.loc[merged['state/region'] == 'PR', 'state'] = 'Puerto Rico'
merged.loc[merged['state/region'] == 'USA', 'state'] = 'United States'
merged.isnull().any()

In [None]:
final = pd.merge(merged, areas, on='state', how='left')
final.head()

In [None]:
final.isnull().any()


In [None]:
final['state'][final['area (sq. mi)'].isnull()].unique()

In [None]:
final.dropna(inplace=True)
final.head()

In [None]:
data2010 = final.query("year == 2010 & ages == 'total'")
data2010.head()

In [None]:
data2010.set_index('state', inplace=True)
density = data2010['population'] / data2010['area (sq. mi)']

---

## Aggregation and Grouping

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/03.08-Aggregation-and-Grouping.ipynb)

In [None]:
import numpy as np
import pandas as pd

class display(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args
        
    def _repr_html_(self):
        return '\n'.join(self.template.format(a, eval(a)._repr_html_())
                         for a in self.args)
    
    def __repr__(self):
        return '\n\n'.join(a + '\n' + repr(eval(a))
                           for a in self.args)

### Planets Data

In [None]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape


In [None]:
planets.head()

### Simple Aggregation in Pandas

In [None]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser

In [None]:
ser.sum()

In [None]:
ser.mean()

In [None]:
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
df


In [None]:
df.mean()

In [None]:
df.mean(axis='columns')

In [None]:
planets.dropna().describe()

### `GroupBy`: Split, Apply, Combine

#### Split, apply, combin

In [None]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

In [None]:
df.groupby('key')

In [None]:
df.groupby('key').sum()

#### The `GroupBy` object

In [None]:
planets.groupby('method')

In [None]:
planets.groupby('method')['orbital_period'].median()

In [None]:
for (method, group) in planets.groupby('method'):
    print("{0:30s} shape={1}".format(method, group.shape))

In [None]:
planets.groupby('method')['year'].describe().unstack()

#### Aggregate, filter, transform, apply

In [None]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df


In [None]:
df.groupby('key').aggregate(['min', np.median, max])

In [None]:
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

In [None]:
def filter_func(x):
    return x['data2'].std() > 4

display('df', "df.groupby('key').std()", "df.groupby('key').filter(filte

In [None]:
df.groupby('key').transform(lambda x: x - x.mean())

In [None]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

display('df', "df.groupby('key').apply(norm_by_data2)")

#### Specifying the split key

In [None]:
L = [0, 1, 0, 1, 2, 0]
display('df', 'df.groupby(L).sum()')

In [None]:
display('df', "df.groupby(df['key']).sum()")

In [None]:
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
display('df2', 'df2.groupby(mapping).sum()')

In [None]:
display('df2', 'df2.groupby(str.lower).mean()')

In [None]:
df2.groupby([str.lower, mapping]).mean()

#### Grouping example

In [None]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)

---

## Pivot Tables

[Chapter/section link](https://nbviewer.jupyter.org/urls/bitbucket.org/dogwynn/pythondatasciencehandbook/raw/master/notebooks/03.09-Pivot-Tables.ipynb)

### Motivating Pivot Tables

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')

In [None]:
titanic.head()

### Pivot Tables by Hand

In [None]:
titanic.groupby('sex')[['survived']].mean()

In [None]:
titanic.groupby(['sex', 'class'])['survived'].aggregate('mean').unstack

### Pivot Table Syntax

In [None]:
titanic.pivot_table('survived', index='sex', columns='class')

#### Multi-level pivot tables

In [None]:
age = pd.cut(titanic['age'], [0, 18, 80])
titanic.pivot_table('survived', ['sex', age], 'class')

In [None]:
fare = pd.qcut(titanic['fare'], 2)
titanic.pivot_table('survived', ['sex', age], [fare, 'class'])

#### Additional pivot table options

In [None]:
titanic.pivot_table(index='sex', columns='class',
                    aggfunc={'survived':sum, 'fare':'mean'})

In [None]:
titanic.pivot_table('survived', index='sex', columns='class', margins=Tru

### Example: Birthrate Data

The following cell replaces `In [10]` and `In [11]`.

In [None]:
births = pd.read_csv('https://belhavencs.nyc3.digitaloceanspaces.com/csc311/births.csv')

In [None]:
births = pd.read_csv('data/births.csv')

In [None]:
births['decade'] = 10 * (births['year'] // 10)
births.pivot_table('births', index='decade', columns='gender', aggfunc=

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
sns.set()  # use Seaborn styles
births.pivot_table('births', index='year', columns='gender', aggfunc='sum').plot()
plt.ylabel('total births per year');

#### Further data exploration

In [None]:
quartiles = np.percentile(births['births'], [25, 50, 75])
mu = quartiles[1]
sig = 0.74 * (quartiles[2] - quartiles[0])

In [None]:
births = births.query('(births > @mu - 5 * @sig) & (births < @mu + 5 * @s

In [None]:
# set 'day' column to integer; it originally was a string due to nulls
births['day'] = births['day'].astype(int)

In [None]:
# create a datetime index from the year, month, day
births.index = pd.to_datetime(10000 * births.year +
                              100 * births.month +
                              births.day, format='%Y%m%d')

births['dayofweek'] = births.index.dayofweek

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl

births.pivot_table('births', index='dayofweek',
                    columns='decade', aggfunc='mean').plot()
plt.gca().set_xticklabels(['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'])
plt.ylabel('mean births by day');

In [None]:
births_by_date = births.pivot_table('births', 
                                    [births.index.month, births.index.day])
births_by_date.head()

In [None]:
births_by_date.index = [pd.datetime(2012, month, day)
                        for (month, day) in births_by_date.index]
births_by_date.head()

In [None]:
# Plot the results
fig, ax = plt.subplots(figsize=(12, 4))
births_by_date.plot(ax=ax);

---

# Section 2: Exercises

In this section, you will be provided a few exercises to demonstrate your understanding of the chapter contents. Each exercise will have a Markdown section describing the problem, and you will provide cells below the description with code, comments and visual demonstrations of your solution.

---

### Problem 1



Use the `seaborn.load_data` function to load the `"titanic"` dataset. 

```python
import seaborn
titanic = seaborn.load_dataset('titanic')
```

Using this dataset and the capabilities provided by Pandas (i.e. use `DataFrame` and `Series` objects and their methods, not NumPy arrays), answer the following questions from the Lab 3:

- What are the minimum, maximum, and mean ages of the following types of passengers on the Titanic?
    - All passengers
    - Survivors 
    - Those that died
- What are the percentage of male passengers that died?
- What are the percentage of female passengers that died?
- What is the average age of men that survived?
- What is the average age of women that survived?
- What is the [mode](https://www.mathsisfun.com/definitions/mode.html) of the class of survivors?
- What is the mode of the class of those that died?

---

### Problem 2

Type the following in a cell and run it:

```python
import os
from pathlib import Path
def get_names():
    if not Path('names.csv').exists():
        names = pd.read_csv('https://belhavencs.nyc3.digitaloceanspaces.com/csc311/names.csv')
        names.to_csv('names.csv', index=None)
    else:
        names = pd.read_csv('names.csv')
    return names

names = get_names()
names.head()
```

The `names` DataFrame is a database of first names of children born since 1880. It has the following columns:

Column | Description
:-----:|:-----------
**name** | First name given 
**gender** | Gender of the children with the name
**births** | The number of children born with the name 
**year** | The year of birth 

Use the `names` DataFrame to answer the following questions using the Pandas API:

- How many people were born in 1947?
- How many boys were born in 1947?
- Using the `groupby` method of `names`, split the data by `year`, apply a `sum`, and assign the output of this operation to a new variable, `births_by_year`. This provides you with a `DataFrame`. Answer the following questions about the `DataFrame` object.
    - What is its index?
    - What are its columns?
    - What do its values mean?
    - Plot the `DataFrame`. I.e. call the `DataFrame`'s `plot()` method.

      ```python
      %matplotlib inline
      data.plot()
      ```
- Using the methods from the previous step, show the usage over time of your name.

In [None]:
import os
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt

def get_names():
    if not Path('names.csv').exists():
        names = pd.read_csv('https://belhavencs.nyc3.digitaloceanspaces.com/csc311/names.csv')
        names.to_csv('names.csv', index=None)
    else:
        names = pd.read_csv('names.csv')
    return names

# Load the names DataFrame
names = get_names()

# Display the first few rows of the DataFrame
print(names.head())

# Question 1: How many people were born in 1947?
total_born_1947 = names[names['year'] == 1947]['births'].sum()
print(f"Total people born in 1947: {total_born_1947}")

# Question 2: How many boys were born in 1947?
boys_born_1947 = names[(names['year'] == 1947) & (names['gender'] == 'boy')]['births'].sum()
print(f"Total boys born in 1947: {boys_born_1947}")

# Group by year and sum births
births_by_year = names.groupby('year')['births'].sum().reset_index()

# Display the index of births_by_year
print(f"Index of births_by_year: {births_by_year.index}")

# Display the columns of births_by_year
print(f"Columns of births_by_year: {births_by_year.columns.tolist()}")

# Display the values meaning
print("The values represent the total number of births for each year.")

# Plot the DataFrame
plt.figure(figsize=(10, 6))
births_by_year.plot(x='year', y='births', kind='line', legend=False)
plt.title('Number of Births Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Births')
plt.grid()
plt.show()

# To show the usage over time of a specific name
name_to_plot = 'Michael'  # Example name, change as needed
name_usage = names[names['name'] == name_to_plot].groupby('year')['births'].sum().reset_index()

plt.figure(figsize=(10, 6))
plt.plot(name_usage['year'], name_usage['births'])
plt.title(f'Usage of the Name {name_to_plot} Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Births')
plt.grid()
plt.show()


---

### Problem 3

Use the `seaborn.load_data` function to load the `"tips"` dataset. This dataset contains tipping information for 246 different meals, where each row in the data is a specific meal. Meal data contains the following columns:

Column | Description
:-----:|:-----------
**total_bill** | cost of the meal
**tip** | amount of the tip
**sex** | sex of the person paying (`"Male"` or `"Female"`)
**smoker** | whether the person paying was a smoker
**day** | day of the week of the meal (`"Thur"`, `"Fri"`, `"Sat"`, `"Sun"`)
**time** | `"Lunch"` or `"Dinner"`
**size** | number of people at table

For each of these questions, provide a pivot table that demonstrates your answer.

- Are females better tippers than males? 
- Are females better tippers than males, regardless of whether it is lunch or dinner?
- Are smokers better tippers than non-smokers?
- Are smokers better tippers than non-smokers, regardless of sex?

In [None]:
import pandas as pd

# Sample DataFrame creation (replace this with your actual DataFrame)
# tipping_data = pd.read_csv('your_data.csv')

# Pivot tables for each question
pivot_female_male = tipping_data.pivot_table(values='tip_amount', index='gender', aggfunc='mean')
pivot_female_male_meal = tipping_data.pivot_table(values='tip_amount', index='gender', columns='meal_type', aggfunc='mean')
pivot_smoker_non_smoker = tipping_data.pivot_table(values='tip_amount', index='smoker', aggfunc='mean')
pivot_smoker_non_smoker_sex = tipping_data.pivot_table(values='tip_amount', index='smoker', columns='gender', aggfunc='mean')

# Displaying pivot tables
print("Are females better tippers than males?")
print(pivot_female_male)

print("\nAre females better tippers than males, regardless of meal type?")
print(pivot_female_male_meal)

print("\nAre smokers better tippers than non-smokers?")
print(pivot_smoker_non_smoker)

print("\nAre smokers better tippers than non-smokers, regardless of sex?")
print(pivot_smoker_non_smoker_sex)
