# Adding and Removing Data

## About the Data
In this notebook, we will be working with earthquake data from September 18, 2018 - October 13, 2018 (obtained from the US Geological Survey (USGS) using the [USGS API](https://earthquake.usgs.gov/fdsnws/event/1/))

## Setup
We will be working with the `data/earthquakes.csv` file again, so we need to handle our imports and read it in.

In [None]:
import pandas as pd

df = pd.read_csv(
    'data/earthquakes.csv', 
    usecols=['time', 'title', 'place', 'magType', 'mag', 'alert', 'tsunami']
)

## Creating new data
### Adding new columns
New columns get added to the right of the original columns and can be a single value, which will be **broadcast** along the rows of the dataframe:

In [None]:
df['source'] = 'USGS API'
df.head()

...or a Boolean mask:

In [None]:
df['mag_negative'] = df.mag < 0
df.head()

#### Adding the `parsed_place` column
We have an entity recognition problem on our hands with the `place` column. There are several entities that have multiple names in the data (e.g., CA and California, NV and Nevada).

In [None]:
df.place.str.extract(r', (.*$)')[0].sort_values().unique()

Replace parts of the `place` names to fit our needs:

In [None]:
df['parsed_place'] = df.place.str.replace(
    r'.* of ', '', regex=True # remove anything saying <something> of <something>
).str.replace(
    'the ', '' # remove "the "
).str.replace(
    r'CA$', 'California', regex=True # fix California
).str.replace(
    r'NV$', 'Nevada', regex=True # fix Nevada
).str.replace(
    r'MX$', 'Mexico', regex=True # fix Mexico
).str.replace(
    r' region$', '', regex=True # chop off endings with " region"
).str.replace(
    'northern ', '' # remove "northern "
).str.replace(
    'Fiji Islands', 'Fiji' # line up the Fiji places
).str.replace(
    r'^.*, ', '', regex=True # remove anything else extraneous from the beginning
).str.strip() # remove any extra spaces

Now we can use a single name to get all earthquakes for that place (although this still isn't perfect):

In [None]:
df.parsed_place.sort_values().unique()

#### Using the `assign()` method to create columns
To create many columns at once or update existing columns, we can use `assign()`:

In [None]:
df.assign(
    in_ca=df.parsed_place.str.endswith('California'),
    in_alaska=df.parsed_place.str.endswith('Alaska')
).sample(5, random_state=0)

With the use of `lambda` functions, the `assign()` method becomes even more powerful. **Lambda functions** are anonymous functions usually defined in one line and for single use. The `assign()` method passes the entire dataframe into the `lambda` function as `x`; from there, we can select the columns `in_ca` and `in_alaska`, which are being created in that same call to `assign()`. Here, we use a `lambda` function to create a new column, `neither`, which tells if the earthquake was neither in Alaska nor California:

In [None]:
df.assign(
    in_ca=df.parsed_place == 'California',
    in_alaska=df.parsed_place == 'Alaska',
    neither=lambda x: ~x.in_ca & ~x.in_alaska
).sample(5, random_state=0)

#### Concatenation
Say we were working with two separate dataframes, one with earthquakes accompanied by tsunamis and the other with earthquakes without tsunamis. If we wanted to look at earthquakes as a whole, we would want to concatenate the dataframes into a single one:

In [None]:
tsunami = df[df.tsunami == 1]
no_tsunami = df[df.tsunami == 0]

tsunami.shape, no_tsunami.shape

Concatenating along the row axis (`axis=0`) is equivalent to appending to the bottom. By concatenating our earthquakes with tsunamis and those without tsunamis, we get the full earthquake data set back:

In [None]:
pd.concat([tsunami, no_tsunami]).shape

Note that the previous result is equivalent to running the `append()` method of the dataframe:

In [None]:
tsunami.append(no_tsunami).shape

We have been working with a subset of the columns from the CSV file, but suppose that now we want to get some of the columns we ignored when we read in the data. Since we have added new columns in this notebook, we won't want to read in the file and perform those operations again. Instead, we will concatenate along the columns (`axis=1`) to add back what we are missing:

In [None]:
additional_columns = pd.read_csv(
    'data/earthquakes.csv', usecols=['tz', 'felt', 'ids']
)
pd.concat([df.head(2), additional_columns.head(2)], axis=1)

Notice what happens if the index doesn't align though:

In [None]:
additional_columns = pd.read_csv(
    'data/earthquakes.csv', usecols=['tz', 'felt', 'ids', 'time'], index_col='time'
)
pd.concat([df.head(2), additional_columns.head(2)], axis=1)

If the index doesn't align, we can align it before attempting the concatentation, which we will discuss in lab 3.

Say we want to join the `tsunami` and `no_tsunami` dataframes, but the `no_tsunami` dataframe has an additional column. The `join` parameter specifies how to handle any overlap in column names (when appending to the bottom) or in row names (when concatenating to the left/right). By default, this is `outer`, so we keep everything; however, if we use `inner`, we will only keep what is in common:

In [None]:
pd.concat(
    [tsunami.head(2), no_tsunami.head(2).assign(type='earthquake')], join='inner'
)

In addition, we use `ignore_index`, since the index doesn't mean anything for us here. This gives us sequential values instead of what we had in the previous result:

In [None]:
pd.concat(
    [tsunami.head(2), no_tsunami.head(2).assign(type='earthquake')], join='inner', ignore_index=True
)

## Deleting Unwanted Data
Columns can be deleted using dictionary syntax with `del`:

In [None]:
del df['source']
df.columns

If we don't know whether the column exists, we should use a `try`/`except` block:

In [None]:
try:
    del df['source']
except KeyError:
    # handle the error here
    print('not there anymore')

We can also use `pop()`. This will allow us to use the series we remove later. Note there will be an error if the key doesn't exist, so we can also use a `try`/`except` here:

In [None]:
mag_negative = df.pop('mag_negative')
df.columns

Notice we have a mask in `mag_negative` now:

In [None]:
mag_negative.value_counts()

Now, we can use `mag_negative` to filter our data:

In [None]:
df[mag_negative].head()

### Using the `drop()` method
We can drop rows by passing a list of indices to the `drop()` method. Notice in the following example that when asking for the first 2 rows with `head()` we get the 3rd and 4th rows because we dropped the original first 2 with `drop([0, 1])`:

In [None]:
df.drop([0, 1]).head(2)

The `drop()` method drops along the row axis by default. If we pass in a list of columns with the `columns` argument, we can delete columns:

In [None]:
cols_to_drop = [
    col for col in df.columns
    if col not in ['alert', 'mag', 'title', 'time', 'tsunami']
]
df.drop(columns=cols_to_drop).head()

We also have the option of using `axis=1`:

In [None]:
df.drop(columns=cols_to_drop).equals(
    df.drop(cols_to_drop, axis=1)
)

By default, `drop()`, along with the majority of `DataFrame` methods, will return a new `DataFrame` object. If we just want to change the one we are working with, we can pass `inplace=True`. This should be used with care:

In [None]:
df.drop(columns=cols_to_drop, inplace=True)
df.head()

<hr>

<div style="overflow: hidden; margin-bottom: 10px;">
    <div style="float: left;">
        <a href="./5-subsetting_data.ipynb">
            <button>&#8592; Previous Notebook</button>
        </a>
    </div>
    <div style="float: right;">
        <a href="../../solutions/ch_02/solutions.ipynb">
            <button>Solutions</button>
        </a>
        <a href="../lab_08/1-wide_vs_long.ipynb">
            <button>Lab 8 &#8594;</button>
        </a>
    </div>
</div>
<hr>