# Pandas Cheatsheet 2

## Exploration Methods

When you import data from an external source you'll want to get a feel for it, lets import some data and learn how to explore it.

In [1]:
import os
import pandas as pd
from datetime import datetime
import numpy as np

users = pd.read_csv(
        'https://raw.githubusercontent.com/treehouse-projects/python-introducing-pandas/master/data/users.csv' , index_col=0)
transactions = pd.read_csv('https://raw.githubusercontent.com/treehouse-projects/python-introducing-pandas/master/data/transactions.csv', index_col = 0)
requests = pd.read_csv('https://raw.githubusercontent.com/treehouse-projects/python-introducing-pandas/master/data/requests.csv',
                      index_col = 0)

In [2]:
len(users)

475

In [3]:
users.shape

(475, 7)

In [4]:
users.count()

first_name        475
last_name         430
email             475
email_verified    475
signup_date       475
referral_count    475
balance           475
dtype: int64

In [5]:
# Can also see true/false for individual columns
users.email_verified.value_counts()

True     389
False     86
Name: email_verified, dtype: int64

In [6]:
users.dtypes

first_name         object
last_name          object
email              object
email_verified       bool
signup_date        object
referral_count      int64
balance           float64
dtype: object

In [7]:
users.describe() #These can also be found out by .mean(), .std(), .max() etc

Unnamed: 0,referral_count,balance
count,475.0,475.0
mean,3.429474,49.933263
std,2.281085,28.280448
min,0.0,0.05
25%,2.0,25.305
50%,3.0,51.57
75%,5.0,74.48
max,7.0,99.9


In [8]:
# Can find the most common values in columns, by default these are sorted descendingly
users.first_name.value_counts().head()

Mark        11
David       10
Michael      9
Jennifer     7
Joshua       7
Name: first_name, dtype: int64

## Rearranging Data

You can create a new `DataFrame` that is sorted by using the `sort_values` method. Let's sort the DataFrame so the user with the highest `balance` is at the top. By default ascending order is assumed

In [9]:
users.sort_values(by = 'balance', ascending = False).head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
twhite,Timothy,White,white5136@hotmail.com,True,2018-07-06,5,99.9
karen.snow,Karen,Snow,ksnow@yahoo.com,True,2018-05-06,2,99.38
king,Billy,King,billy.king@hotmail.com,True,2018-05-29,4,98.8
king3246,Brittney,King,brittney@yahoo.com,True,2018-04-15,6,98.79
crane203,Valerie,Crane,valerie7051@hotmail.com,True,2018-05-12,3,98.69


This creates a new `DataFrame` which can be assigned to a new variable. If we want to change the order but not reassign we can use the `inplace` argument

In [10]:
users.sort_values(by=['first_name', 'last_name'], inplace = True)
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aaron,Aaron,Davis,aaron6348@gmail.com,True,2018-08-31,6,18.14
adam.saunders,Adam,Saunders,adam@gmail.com,False,2018-05-29,3,72.12
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,2018-06-16,7,25.85
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,2018-04-28,3,30.01
edwards8658,Alan,Edwards,alan8469@hotmail.com,False,2018-02-24,1,24.93


If the we want the order back to the original we can get this by sorting by index

In [11]:
users.sort_index(inplace=True)
users.head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aaron,Aaron,Davis,aaron6348@gmail.com,True,2018-08-31,6,18.14
acook,Anthony,Cook,cook@gmail.com,True,2018-05-12,2,55.45
adam.saunders,Adam,Saunders,adam@gmail.com,False,2018-05-29,3,72.12
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,2018-04-28,3,30.01
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,2018-06-16,7,25.85


## Selecting Data

A common need is to grab a subset of data which meets a certain criterea. You can do this by indexing a `DataFrame` much like with a `NumPy.ndarray`.

Cashbox uses a referal system, everyone you refer gains you £5. Let's see everyone who has __not__ taken advantage of that. The number of referrals a user has made can be found in the `referral_count` column.

In [12]:
no_referrals_index = users['referral_count'] < 1
no_referrals_index.head()

aaron            False
acook            False
adam.saunders    False
adrian           False
adrian.blair     False
Name: referral_count, dtype: bool

Using this boolean `Series` just created, __`no_referrals_index`__, we can retrieve all rows where that comparison was True 

In [13]:
users[no_referrals_index].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
alan9443,Alan,Pope,pope@hotmail.com,True,2018-04-17,0,56.09
andrew.alvarez,Andrew,Alvarez,aalvarez@hotmail.com,False,2018-08-01,0,81.66
boyer7005,Sara,Boyer,boyer8636@gmail.com,True,2018-07-31,0,91.41
brandon.gilbert,Brandon,Gilbert,brandon.gilbert@hotmail.com,True,2018-04-28,0,10.17
brooke2027,Brooke,,brooke6938@gmail.com,False,2018-05-23,0,7.22


### Inversed Index

A handy shortcut is to prefix with a `~`. This returns the inverse of the boolean `Series`, it is called the `bitwise not` not operator

In [14]:
~no_referrals_index.head()

aaron            True
acook            True
adam.saunders    True
adrian           True
adrian.blair     True
Name: referral_count, dtype: bool

In [15]:
# We can use this inverse to find where referral values DO NOT equal zero
users[~no_referrals_index].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
aaron,Aaron,Davis,aaron6348@gmail.com,True,2018-08-31,6,18.14
acook,Anthony,Cook,cook@gmail.com,True,2018-05-12,2,55.45
adam.saunders,Adam,Saunders,adam@gmail.com,False,2018-05-29,3,72.12
adrian,Adrian,Fang,adrian.fang@teamtreehouse.com,True,2018-04-28,3,30.01
adrian.blair,Adrian,Blair,adrian9335@gmail.com,True,2018-06-16,7,25.85


### In `loc`

Boolean `Series` as an index may also be used as an index the `DataFrame.loc` object.


In [16]:
users.loc[no_referrals_index, ['balance', 'email']].head()

Unnamed: 0,balance,email
alan9443,56.09,pope@hotmail.com
andrew.alvarez,81.66,aalvarez@hotmail.com
boyer7005,91.41,boyer8636@gmail.com
brandon.gilbert,10.17,brandon.gilbert@hotmail.com
brooke2027,7.22,brooke6938@gmail.com


It is also possible to do the comparison inline, without storing the index in a variable

In [17]:
users[users['referral_count']==0].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
alan9443,Alan,Pope,pope@hotmail.com,True,2018-04-17,0,56.09
andrew.alvarez,Andrew,Alvarez,aalvarez@hotmail.com,False,2018-08-01,0,81.66
boyer7005,Sara,Boyer,boyer8636@gmail.com,True,2018-07-31,0,91.41
brandon.gilbert,Brandon,Gilbert,brandon.gilbert@hotmail.com,True,2018-04-28,0,10.17
brooke2027,Brooke,,brooke6938@gmail.com,False,2018-05-23,0,7.22


Just like the NumPy `ndarray` it's possible for a boolean `Series` to be compared to another boolean `Series` using bitwise operators. 

Don't forget to use parenthesis to control the order of operations

In [18]:
# select all where they haven't made a referral and the email address hasn't been verified
users[(users['referral_count']==0) & (users['email_verified'] == False)].head()

Unnamed: 0,first_name,last_name,email,email_verified,signup_date,referral_count,balance
andrew.alvarez,Andrew,Alvarez,aalvarez@hotmail.com,False,2018-08-01,0,81.66
brooke2027,Brooke,,brooke6938@gmail.com,False,2018-05-23,0,7.22
dedwards,Deborah,Edwards,deborah@hotmail.com,False,2018-01-13,0,25.63
jacob.davis,Jacob,Davis,jdavis@gmail.com,False,2018-09-16,0,95.7
jason7792,Jason,Holland,jholland@hotmail.com,False,2018-01-18,0,22.37


## Manipulation Techniques

There are lots of ways to change the shape and data in the `DataFrame`

### Assigning Values

Let's assume the cashbox customer support team let us know that that the person `Adrian Fang` has a balance of £30.01 when it should be £35

In [19]:
# First make sure there is only one Adrian Fang
len(users[(users['first_name'] == 'Adrian') & (users['last_name'] == 'Fang')])

1

In [20]:
# Using chaining like below will not work for updating values
# users[(users.first_name == "Adrian") & (users.last_name == "Fang")]['balance'] = 35.00
# The solution is to use .loc
users.loc[(users.first_name == 'Adrian') & (users.last_name == 'Fang'), 'balance'] = 35 
users.loc['adrian']

first_name                               Adrian
last_name                                  Fang
email             adrian.fang@teamtreehouse.com
email_verified                             True
signup_date                          2018-04-28
referral_count                                3
balance                                      35
Name: adrian, dtype: object

In [21]:
# using .at is also a quick and easy way for scalar values
users.at['adrian','balances'] = 35

We now need to update the `transactions` table to reflect this. First we need to create a record to then insert into the transactions table. Looking at the transactions table can show us what data we need to insert it

In [22]:
transactions.head()

Unnamed: 0,sender,receiver,amount,sent_date
0,stein,smoyer,49.03,2018-01-24
1,holden4580,joshua.henry,34.64,2018-02-06
2,rose.eaton,emily.lewis,62.67,2018-02-15
3,lmoore,kallen,1.94,2018-03-05
4,scott3928,lmoore,27.82,2018-03-10


In [23]:
record = dict(sender = np.nan, receiver = 'adrian', amount = 4.99, sent_date = datetime.now().date())


### Appending with DataFrame.append

The `DataFrame.append` method adds a new row to a new dataset. This method doesn't change the original dataset but instead returns a copy of the DataFrame with the new row(s) appended

The index for our __`transactions`__ is auto-assigned, so we'll set the the __`ignore_index`__ keyword argument to `True` so it gets generated 

In [24]:
# Remember this is returning a copy 
transactions.append(record, ignore_index = True).tail()
# If there are multiple rows the more effective way is using the .concat method

Unnamed: 0,sender,receiver,amount,sent_date
994,king3246,john,25.37,2018-09-25
995,shernandez,kristen1581,75.77,2018-09-25
996,leah6255,jholloway,63.62,2018-09-25
997,pamela,michelle4225,2.54,2018-09-25
998,,adrian,4.99,2018-11-20


### Adding columns

You can add columns in a similar way as rows and missing values will be set in `np.nan`

#### Setting with Enlargement


In [25]:
latest_id = transactions.index.max()
transactions.at[latest_id, 'notes'] = 'Adrian phoned up to change the order'
transactions.tail()

Unnamed: 0,sender,receiver,amount,sent_date,notes
993,coleman,sarah.evans,36.29,2018-09-25,
994,king3246,john,25.37,2018-09-25,
995,shernandez,kristen1581,75.77,2018-09-25,
996,leah6255,jholloway,63.62,2018-09-25,
997,pamela,michelle4225,2.54,2018-09-25,Adrian phoned up to change the order


In [26]:
# Columns can be added and assigned from an expression
transactions['Big Orders'] = transactions.amount > 70
transactions.tail()

Unnamed: 0,sender,receiver,amount,sent_date,notes,Big Orders
993,coleman,sarah.evans,36.29,2018-09-25,,False
994,king3246,john,25.37,2018-09-25,,False
995,shernandez,kristen1581,75.77,2018-09-25,,True
996,leah6255,jholloway,63.62,2018-09-25,,False
997,pamela,michelle4225,2.54,2018-09-25,Adrian phoned up to change the order,False


### Renaming columns 

renaming columns can be achieved using the `DataFrame.rename()` method. You specify the current name(s), as the key(s) and the new name(s) as the value(s).

By default this returns a copy, but you can use the `inplace` parameter to append it to the existing `DataFrame`

In [27]:
transactions.rename(columns = {'Big Orders':'big_orders'}, inplace = True)
transactions.head()

Unnamed: 0,sender,receiver,amount,sent_date,notes,big_orders
0,stein,smoyer,49.03,2018-01-24,,False
1,holden4580,joshua.henry,34.64,2018-02-06,,False
2,rose.eaton,emily.lewis,62.67,2018-02-15,,False
3,lmoore,kallen,1.94,2018-03-05,,False
4,scott3928,lmoore,27.82,2018-03-10,,False


### Deleting Columns

In addition to slicing a `DataFrame` to simply not include a specific existing column, you can also drop columns by name. Let's remove the two that have been added in place. 

In [28]:
transactions.drop(columns = ['notes', 'big_orders'], inplace = True)
transactions.head()

Unnamed: 0,sender,receiver,amount,sent_date
0,stein,smoyer,49.03,2018-01-24
1,holden4580,joshua.henry,34.64,2018-02-06
2,rose.eaton,emily.lewis,62.67,2018-02-15
3,lmoore,kallen,1.94,2018-03-05
4,scott3928,lmoore,27.82,2018-03-10


## Combining DataFrames

Cashbox has provided us with several seperate csv files. Let's take a look at the two files `transactions.csv` and `requests.csv`. Requests are made in the application when one user requests cash from another. Requests are not required for a transaction to occur.

Let's see if we can see how many requests and payments have been made, to do that we'll have to combine the two datasets

In [29]:
transactions.shape, requests.shape

((998, 4), (313, 4))

In [30]:
transactions.head()

Unnamed: 0,sender,receiver,amount,sent_date
0,stein,smoyer,49.03,2018-01-24
1,holden4580,joshua.henry,34.64,2018-02-06
2,rose.eaton,emily.lewis,62.67,2018-02-15
3,lmoore,kallen,1.94,2018-03-05
4,scott3928,lmoore,27.82,2018-03-10


In [31]:
requests.head()

Unnamed: 0,from_user,to_user,amount,request_date
0,chad.chen,paula7980,78.61,2018-02-12
1,kallen,lmoore,1.94,2018-02-23
2,gregory.blackwell,rodriguez5768,30.57,2018-03-04
3,kristina.miller,john.hardy,77.05,2018-03-12
4,lacey8987,mcguire,54.09,2018-03-13


I'd like to see all the requests that have a matching transaction based on the users and the amount involved.

In order to do this we will merge the datasets together.

We'll create a new dataset by using the `DataFrame.merge()` method

In [32]:
succesful_requests = requests.merge(
    transactions,
    left_on = ['from_user', 'to_user', 'amount'],
    right_on = ['receiver', 'sender', 'amount']
)
succesful_requests.head()

Unnamed: 0,from_user,to_user,amount,request_date,sender,receiver,sent_date
0,chad.chen,paula7980,78.61,2018-02-12,paula7980,chad.chen,2018-07-15
1,kallen,lmoore,1.94,2018-02-23,lmoore,kallen,2018-03-05
2,gregory.blackwell,rodriguez5768,30.57,2018-03-04,rodriguez5768,gregory.blackwell,2018-03-17
3,kristina.miller,john.hardy,77.05,2018-03-12,john.hardy,kristina.miller,2018-04-25
4,lacey8987,mcguire,54.09,2018-03-13,mcguire,lacey8987,2018-06-28


from this we can work out the time between requesting a payment and receiving the payment.

The first thing we need to do for this is to convert the columns to the right type, `datetime`. This can be done by using the `pandas.to_datetime` method

In [33]:
succesful_requests['request_date'] = pd.to_datetime(succesful_requests['request_date'])
succesful_requests['sent_date'] = pd.to_datetime(succesful_requests['sent_date'])
succesful_requests.dtypes

from_user               object
to_user                 object
amount                 float64
request_date    datetime64[ns]
sender                  object
receiver                object
sent_date       datetime64[ns]
dtype: object

From this we can subtract the dates (via vectorisation) to find out the timedelta. Then from this we can look at the highest timedelta

In [34]:
succesful_requests['timedelta'] = succesful_requests.sent_date - succesful_requests.request_date
succesful_requests.sort_values(by = 'timedelta', ascending = False).head()

Unnamed: 0,from_user,to_user,amount,request_date,sender,receiver,sent_date,timedelta
0,chad.chen,paula7980,78.61,2018-02-12,paula7980,chad.chen,2018-07-15,153 days
33,sthompson,andrade,14.07,2018-05-09,andrade,sthompson,2018-09-21,135 days
4,lacey8987,mcguire,54.09,2018-03-13,mcguire,lacey8987,2018-06-28,107 days
53,marcus.berry,melissa.mendoza,71.48,2018-05-31,melissa.mendoza,marcus.berry,2018-09-06,98 days
39,bishop,massey2102,18.27,2018-05-16,massey2102,bishop,2018-08-15,91 days


From this dataframe can also tell us the total value of money transferred cashbox

In [35]:
"Wow, £{:.2f} has passed through CashBox over {} transactions".format(
    succesful_requests.amount.sum(),
    len(succesful_requests)
)

'Wow, £10496.47 has passed through CashBox over 214 transactions'