# Querying your data

Once your data is read in and available as a DataFrame, Pandas provides a whole suite of tools for extracting information from it.

Let's start by looking at some example data which contains information about the amounts that people at a restaurant paid and tipped for their meals:

In [2]:
import pandas as pd

tips = pd.read_csv("./data/tips.csv")
tips

Unnamed: 0,total_bill,tip,day,time,size
0,16.99,0.71,Sun,Dinner,2
1,10.34,1.16,Sun,Dinner,3
2,21.01,2.45,Sun,Dinner,3
3,23.68,2.32,Sun,Dinner,2
4,24.59,2.53,Sun,Dinner,4
...,...,...,...,...,...
239,29.03,4.14,Sat,Dinner,3
240,27.18,1.40,Sat,Dinner,2
241,22.67,1.40,Sat,Dinner,2
242,17.82,1.22,Sat,Dinner,2


The first thing that you usually want to be able to do is to pull apart the overall table to get at specific bits of data from inside.

When using `list`s and `dict`s in Python, the square-bracket syntax was used to fetch an item from the container. In Pandas we can use the same syntax but it's a much more powerful tool.

If you pass a single string to the square brackets of a `DataFrame` it will return to you just that one column:

In [3]:
tips["total_bill"]

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64

Accessing a column like this returns an object called [a `Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) which is the second of the two main Pandas data types. Don't worry too much about these just yet but think of them as being a single column of the `DataFrame`, along with the index of the `DataFrame`.

In [4]:
tips['total_bill'].values # convert to array

array([16.99, 10.34, 21.01, 23.68, 24.59, 25.29,  8.77, 26.88, 15.04,
       14.78, 10.27, 35.26, 15.42, 18.43, 14.83, 21.58, 10.33, 16.29,
       16.97, 20.65, 17.92, 20.29, 15.77, 39.42, 19.82, 17.81, 13.37,
       12.69, 21.7 , 19.65,  9.55, 18.35, 15.06, 20.69, 17.78, 24.06,
       16.31, 16.93, 18.69, 31.27, 16.04, 17.46, 13.94,  9.68, 30.4 ,
       18.29, 22.23, 32.4 , 28.55, 18.04, 12.54, 10.29, 34.81,  9.94,
       25.56, 19.49, 38.01, 26.41, 11.24, 48.27, 20.29, 13.81, 11.02,
       18.29, 17.59, 20.08, 16.45,  3.07, 20.23, 15.01, 12.02, 17.07,
       26.86, 25.28, 14.73, 10.51, 17.92, 27.2 , 22.76, 17.29, 19.44,
       16.66, 10.07, 32.68, 15.98, 34.83, 13.03, 18.28, 24.71, 21.16,
       28.97, 22.49,  5.75, 16.32, 22.75, 40.17, 27.28, 12.03, 21.01,
       12.46, 11.35, 15.38, 44.3 , 22.42, 20.92, 15.36, 20.49, 25.21,
       18.24, 14.31, 14.  ,  7.25, 38.07, 23.95, 25.71, 17.31, 29.93,
       10.65, 12.43, 24.08, 11.69, 13.42, 14.26, 15.95, 12.48, 29.8 ,
        8.52, 14.52,

If you pass a list of column names to the square brackets then you can grab out just those columns:

In [5]:
tips[["total_bill", "tip"]]

Unnamed: 0,total_bill,tip
0,16.99,0.71
1,10.34,1.16
2,21.01,2.45
3,23.68,2.32
4,24.59,2.53
...,...,...
239,29.03,4.14
240,27.18,1.40
241,22.67,1.40
242,17.82,1.22


In this case it gives you back another DataFrame, just with only the required columns present.

### Getting rows

If you want to select a *row* from a `DataFrame` then you can use the `.loc` (short for "location") attribute which allows you to pass index values like:

In [6]:
tips.loc[2]

total_bill     21.01
tip             2.45
day              Sun
time          Dinner
size               3
Name: 2, dtype: object

In [7]:
tips.iloc[2,0]

21.01

If you want to grab a single value from the table, you can follow the row label with the column name that you want:

In [8]:
tips.loc[2, "total_bill"]

21.01

In [9]:
tips.head()

Unnamed: 0,total_bill,tip,day,time,size
0,16.99,0.71,Sun,Dinner,2
1,10.34,1.16,Sun,Dinner,3
2,21.01,2.45,Sun,Dinner,3
3,23.68,2.32,Sun,Dinner,2
4,24.59,2.53,Sun,Dinner,4


### Exercise 1

The `size` column in the data is the number of people in the dining party. 

Extract this column from the DataFrame.



In [10]:
tips["size"]

0      2
1      3
2      3
3      2
4      4
      ..
239    3
240    2
241    2
242    2
243    2
Name: size, Length: 244, dtype: int64

## Descriptive statistics

Now that we know how to refer to individual columns, we can start asking questions about the data therein. If you've worked with columns of data in Excel for example, you've probably come across the `SUM()` and `AVERAGE()` functions to summarise data. We can do the same thing in pandas by calling the `sum()` or `mean()` methods on a column:

In [11]:
tips["total_bill"].sum()

4827.77

In [12]:
tips["total_bill"].mean()

19.78594262295082

You can see a list of all the possible functions you can call in the [documentation for `Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html). So for example, you can also ask for the maximum value from a column with the `max()` method. 

In [13]:
tips["tip"].max()

7.0

In some situations, you don't just want to get the *value* of the maximum, but rather to find out *which row* it came from. In cases like that there is the `idxmax()` method which give you the *index label* of the row with the maximum:

In [14]:
tips["total_bill"].idxmax()

170

So we know that the value of the maximum tip was £7 and it was found in the row with the label `170`.

You can then use this information with the `.loc` attribute to get the rest of the information for that row:

In [15]:
index_of_max_bill = tips["total_bill"].idxmax()
tips.loc[index_of_max_bill]

total_bill     50.81
tip              7.0
day              Sat
time          Dinner
size               3
Name: 170, dtype: object

### Exercise 2

Find the value of the tip that was paid for the smallest total bill.

Hint: Have a look at the [documentation page for `Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html). There's a function which works like `idxmax()` but finds the minimum.



In [16]:
index_of_smallest_bill = tips["total_bill"].idxmin()
tips["tip"][index_of_smallest_bill]

0.7

## Acting on columns

Functions like `sum()` and `max()` summarise down the column to a single value. In some situations we instead want to manipulate a column to create a new column.

For example, the data in the table is in British pounds. If we wanted to convert it into the number of pennies then we need to multiply each value by 100. In Pandas you can refer to an entire column and perform mathematical operations on it and it will apply the operation to each row:

In [17]:
tips["total_bill"] * 100 # to pennies * 100, to dollars * 1.29

0      1699.0
1      1034.0
2      2101.0
3      2368.0
4      2459.0
        ...  
239    2903.0
240    2718.0
241    2267.0
242    1782.0
243    1878.0
Name: total_bill, Length: 244, dtype: float64

The data in row `0` was previously 16.99 but the result here is 1699.0, and likewise for every other row.

You can do any mathematical operation that Python supports, such as `+`, `-` and `/`.

## Combining columns


As well as operating on individual columns, you can combine together multiple columns. Any operation you do between two columns will be done *row-wise*, that is adding two columns will add together the two values from the first row of each, then the second row from each etc.

For example if we wanted to find out, for each entry in our table what the ratio between tip amount and total bill was, we could divide one column by the other:

In [18]:
tips["tip"] / tips["total_bill"]

0      0.041789
1      0.112186
2      0.116611
3      0.097973
4      0.102887
         ...   
239    0.142611
240    0.051508
241    0.061756
242    0.068462
243    0.111821
Length: 244, dtype: float64

Of course, if we want the tip *percentage* so we need to multiply the value by 100:

In [19]:
(tips["tip"] / tips["total_bill"])*100

0       4.178929
1      11.218569
2      11.661114
3       9.797297
4      10.288735
         ...    
239    14.261109
240     5.150846
241     6.175562
242     6.846240
243    11.182109
Length: 244, dtype: float64

It can get messy and hard-to-read doing too many things on one line, so it's a good idea to split each part of your calculation onto its own line, giving each step its own variable name along the way.

In [20]:
tip_fraction = tips["tip"] / tips["total_bill"]
tip_percent = tip_fraction*100
tip_percent

0       4.178929
1      11.218569
2      11.661114
3       9.797297
4      10.288735
         ...    
239    14.261109
240     5.150846
241     6.175562
242     6.846240
243    11.182109
Length: 244, dtype: float64

### Exercise 3

The `total_bill` column give the total amount for the entire dining party. Calculate the amount spent *per person* for each row in the DataFrame.

Extra: calculate the average and the standard deviation of this data. You might need to take a look at [the documentation page for the `Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) type.



In [21]:
bill_per_person = tips["total_bill"] / tips["size"]
bill_per_person

0       8.495000
1       3.446667
2       7.003333
3      11.840000
4       6.147500
         ...    
239     9.676667
240    13.590000
241    11.335000
242     8.910000
243     9.390000
Length: 244, dtype: float64

In [22]:
bill_per_person.mean()

7.888229508196722

In [23]:
bill_per_person.std()

2.9143496626220995

## Adding new columns

New columns can be added to a `DataFrame` by assigning them by index (as you would for a Python `dict`):

In [24]:
tips["percent_tip"] = (tips["tip"] / tips["total_bill"])*100
tips

Unnamed: 0,total_bill,tip,day,time,size,percent_tip
0,16.99,0.71,Sun,Dinner,2,4.178929
1,10.34,1.16,Sun,Dinner,3,11.218569
2,21.01,2.45,Sun,Dinner,3,11.661114
3,23.68,2.32,Sun,Dinner,2,9.797297
4,24.59,2.53,Sun,Dinner,4,10.288735
...,...,...,...,...,...,...
239,29.03,4.14,Sat,Dinner,3,14.261109
240,27.18,1.40,Sat,Dinner,2,5.150846
241,22.67,1.40,Sat,Dinner,2,6.175562
242,17.82,1.22,Sat,Dinner,2,6.846240


### Exercise 4

Take the "bill per person" result you calculated in the last exercise and add it as a new column, `bill_per_person`, in the DataFrame.



In [25]:
tips["bill_per_person"] = tips["total_bill"] / tips["size"]
tips

Unnamed: 0,total_bill,tip,day,time,size,percent_tip,bill_per_person
0,16.99,0.71,Sun,Dinner,2,4.178929,8.495000
1,10.34,1.16,Sun,Dinner,3,11.218569,3.446667
2,21.01,2.45,Sun,Dinner,3,11.661114,7.003333
3,23.68,2.32,Sun,Dinner,2,9.797297,11.840000
4,24.59,2.53,Sun,Dinner,4,10.288735,6.147500
...,...,...,...,...,...,...,...
239,29.03,4.14,Sat,Dinner,3,14.261109,9.676667
240,27.18,1.40,Sat,Dinner,2,5.150846,13.590000
241,22.67,1.40,Sat,Dinner,2,6.175562,11.335000
242,17.82,1.22,Sat,Dinner,2,6.846240,8.910000
