# 01-Beginner: Pandas essentials
For these exercises, we are going to be using the [Pandas](https://pandas.pydata.org/about/) Python package, so the first thing to do is to import Pandas. To do this, click on the code block below and press the 'Run' button to the left.

In [2]:
import pandas as pd

## Creating data
Pandas uses `DataFrame` to store and manipulate data. Run the code below to create a `DataFrame` using some simple test data. Here we have used the `head()` method to show (part of) the `DataFrame`.

In [3]:
df_test = pd.DataFrame(data=[['1',2,3.0],[4,'5',6],[0.07,88.0,999]],columns=['A','B','C'])
df_test.head()

Unnamed: 0,A,B,C
0,1.0,2.0,3.0
1,4.0,5.0,6.0
2,0.07,88.0,999.0


### Exercise-01: Heads and tails
Try running the above code with `head()` replaced by `head(1)`, `head(2)`, and `head(3)`, respectively. Similarly, try running the above code with `head()` replaced by `tail(1)`, `tail(2)`, and `tail(3)`, respectively. What do `head()` and `tail()` do?

## Saving data
Often it is useful to save data. For example, we can save the previous test data to a comma-separated values (`.csv`) file by running the following code. After you have run the code, identify *where* the file was saved (i.e. which folder?).

In [4]:
df_test.to_csv('./test.csv')

### Exercise-02: Parents and siblings
Sometimes we need to save/load data not in the current directory. Run the code below to save the same test data to two other locations. After you have run the code, identify *where* the files were saved.

In [None]:
df_test.to_csv('../test_parent.csv')
df_test.to_csv('../data/test_sibling.csv')

Before we move on, let's clean up the mess you made! Run the code below to remove the test `.csv` files we just created.

In [None]:
!rm ./test.csv
!rm ../test_parent.csv
!rm ../data/test_sibling.csv

(Note that the above code uses `!` to run terminal commands from inside a notebook!) Cool, no?

## Loading data
Ok, let's get started with some *real* data. To work with real data, we first need to load it. If the data is in a `.csv` file then we can load the data simply using the Pandas `read_csv` function. Run the code below to load the `listings_sample.csv` data into a Pandas `DataFrame`.

In [5]:
import pandas as pd
df_listings = pd.read_csv('../data/listings_sample.csv', 
                usecols=['host_id','id','name','room_type','price'])
df_listings.head()

Unnamed: 0,host_id,id,name,room_type,price
0,43039,11551,Arty and Bright London Apartment in Zone 2,Entire home/apt,$110.00
1,54730,13913,Holiday London DB Room Let-on going,Private room,$40.00
2,60302,15400,Bright Chelsea Apartment. Chelsea!,Entire home/apt,$75.00
3,67564,17402,Superb 3-Bed/2 Bath & Wifi: Trendy W1,Entire home/apt,$307.00
4,103583,25123,Clean big Room in London (Room 1),Private room,$29.00


The data you just loaded relates to real listings on [Airbnb](airbnb.com), sampled from a dataset from the [Inside Airbnb](http://insideairbnb.com/about.html) website. Note how the `usecols` argument was used to specify which columns of the data to load.

### Exercise-03: Load host data
Now it's time to try to load some data yourself. Using the code above as a guide, load the `hosts_sample.csv` file into a data-frame named `df_hosts` with columns `['id','host_name','host_since']`.

In [6]:
# (SOLUTION) Delete in public branch
df_hosts = pd.read_csv('../data/hosts_sample.csv', usecols=['id','host_name','host_since'])
df_hosts.head()

Unnamed: 0,id,host_name,host_since
0,43039.0,Adriano,2009-10-03
1,54730.0,Alina,2009-11-16
2,60302.0,Philippa,2009-12-05
3,67564.0,Liz,2010-01-04
4,103583.0,Grace,2010-04-05


## Preparing data
Often the data we load for analysis comes with values we cannot immediately work with, and we need to remove/format those values to be able to do the analysis we want to do. 

In [None]:
sum(df_listings['price'].isna())

As another example, the `price` column in `df_listings` contains [strings](https://www.w3schools.com/python/python_strings.asp) with currency symbols (e.g.`$`) and commas (e.g. `1,000,000`) and we would need to format these string values to a valid numerical/float value before we can use the values in the `price` column in an analysis.

In [None]:
def format_price(price):
    return(float(price.replace('$','').replace(',','')))

df_listings['price_$'] = df_listings['price'].apply(format_price)
df_listings[['price','price_$']].head()

Unnamed: 0,price,price_$
0,$110.00,110.0
1,$40.00,40.0
2,$75.00,75.0
3,$307.00,307.0
4,$29.00,29.0


Note how we defined the `format_price` function and used the `apply` method.

In [None]:
def get_year(host_since):
    return(host_since.split('-')[0])

df_hosts['host_since_year'] = df_hosts['host_since'].apply(host_age_months)
df_hosts.head()

ValueError: time data 'nan' does not match format '%Y-%m-%d'

## Querying data
Pandas can be used to retrieve slices of data, like a DB. Let's look at some slices of the data.

### Example 01-04: Query listings by price
A common way to query data is by date. Old data, recebt data, data from a particular day.

In [None]:
df[dd['date']]

### Query listing by number of bedrooms
Easy... 

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=2c6f047c-21a6-4149-814c-b3f60a9bf973' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>