# Manipulating Data in Pandas

We won't cover every possible way to access data in this tutorial, but this should give you a sense of some of the main ways you can access and work with tabular data with Pandas. Things we'll cover:
- Selecting and accessing data from a DataFrame
- Filtering and reindexing data
- Transforming, sorting, aggregating, deduplicating

In [None]:
import pandas as pd
robocall_df = pd.read_csv("Data/Telemarketing_RoboCall_Weekly_Data_Transformed.csv")

We've loaded our data into a DataFrame which is much like a database table, or a single table in a spreadsheet. This table has rows and columns. Pandas has also added an index column which you'll see on the far left of the DataFrame. There is a LOT that a DataFrame can do - you can familiarize yourself with all it offers in the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). 

In [None]:
robocall_df

If you just want a list of the column names that's easy enough:

In [None]:
robocall_df.columns

To select a single column from the DataFrame you can use the name of the column within brackets:

In [None]:
robocall_df["phone_number"]

You can get multiple columns by specifying them in a list.

In [None]:
robocall_df[["phone_number", "type_telemarketing"]]

And if we want to get just one row of that column we can use a second set of brackets with the row index ("13" in the example below).

In [None]:
robocall_df["phone_number"][13]

In some cases you might want to change a piece of data, for instance in the process of cleaning it up. So any edits you make directly to the dataframe will be reflected in the data. Verify in the output below that row index 13 has had it's phone_number updated. In other cases you may want to replace many values at once which can be done using the `.replace()` [function](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.replace.html)

In [None]:
robocall_df["phone_number"][13] = '404-608-4860'
robocall_df

### Filtering & Reindexing

Let's go back to our original data by reloading the file.

In [None]:
robocall_df = pd.read_csv("Data/Telemarketing_RoboCall_Weekly_Data_Transformed.csv")

You notice that a lot of datasets you work with are deficient in some way or another. For instance, they may be missing values in some rows and columns. When it loads a file in Pandas is smart enough to mark empty fields as "NaN" which stands for Not a Number. 

In [None]:
robocall_df["type_telemarketing"]

We can test for these values using the ``isnull`` and ``notnull`` functions which will return a True / False value based on the value of the item. 

In [None]:
robocall_df["type_telemarketing"].isnull()

And we may want to filter out those empty values. We can do that with a special selector syntax. In the following notice that within the brackets we tell it to select rows for which type_telemarketing is not null. Another useful function for removing missing data is `dropna()` which has [parameters](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.dropna.html) that allow you to drop rows or columns have have any or all values that are missing. 

In [None]:
robocall_df[robocall_df["type_telemarketing"].notnull()]

If you need to filter by more than one column you can combine them using the ``&`` character. Though note you need an extra pair of parentheses around each logical test. Let's grab this dataframe and assign it to another variable. 

In [None]:
maryland_df = robocall_df[(robocall_df["type_telemarketing"].notnull()) & (robocall_df["state"]=="Maryland")]
maryland_df

After all that filtering you might wonder how much data you have left. To check the shape (i.e. number of rows and columns) of a DataFrame just append ``.shape`` at the end. 

In [None]:
maryland_df.shape

You'll notice that in the filtered data frame the index starts from "617", but maybe we want to reset it to start at zero now that we're focused on Maryland. We can do that, but remember we have to assign the new dataframe back to the same name (i.e. `maryland_df`)

In [None]:
maryland_df = maryland_df.reset_index(drop=True)
maryland_df

### Accessing Rows

We may also sometimes need to access a row of data from a data frame. This can be done with the `iloc` accessor and providing the integer-based position in brackets, or with the `loc` accessesor and providing the label-based index in brackets. In this example the two are equivalent.

In [None]:
maryland_df.iloc[0]

And if we need that row as an array we can use the `.values` which is sometimes necessary is we want to do other types of mathematical operations:

In [None]:
maryland_df.iloc[0].values

### Applying Data Transformations

Sometimes you will want to transform your data by applying a transformation function to each datum within a column or row. We don't necessarily need to, but to show you how to do it, let's make all the text in the `type_telemarketing` column lowercase. We define a function which takes in an input datum (x in this case) and returns the transformed value of that. We use the `apply` [function](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.apply.html) on the dataframe to apply that function to an entire column (or to an entire row).

In [None]:
def lowercaser(x):
    return x.lower()

maryland_df["type_telemarketing"] = maryland_df["type_telemarketing"].apply(lowercaser)

### Sorting 

Oftentimes you will want to sort your data to get an overview or see what is at the top or bottom of a ranking. To sort by values use the `sort_values` [function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values). Below we sort by the time_issued column from most recent to least recent. 

In [None]:
maryland_df.sort_values(by="time_issued", ascending=False)

### Aggregation 

You'll often want to summarize DataFrames to get an overview of your data, or to aggregate it. The `describe()` function is useful for an initial overview, but there are many others such as `min()`, `max()`, `sum()`, `mean()`, and [many others](http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics)

In [None]:
maryland_df.describe()

A useful analytic operation is to create groups that can then be summarized. This can be accomplished with the `groupby()` [function](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html). Various [aggregation functions](http://pandas.pydata.org/pandas-docs/stable/groupby.html) can then be applied. 

In [None]:
state_groups = robocall_df.groupby("state")
state_groups.count()

### Deduplication

At times your data will have duplicate rows in it and you'll want to remove those. To check for duplicated rows you can use the `.duplicated()` [function](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.duplicated.html), and if you want to check for duplicates within a certain column you can pass that as a parameter. Let's say we want to detect duplicate caller id numbers: 

In [None]:
maryland_df.duplicated(["caller_id"])

We can then drop the rows detected as duplicates using the `drop_duplicates()` [function](http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html). 

In [None]:
maryland_df.drop_duplicates(["caller_id"]).shape