<p align="center">
<img src="https://github.com/adelnehme/advanced-python/blob/master/assets/hsbc_datacamp.png?raw=True" alt = "DataCamp icon" width="50%">
</p>
<br>

## **Advanced Python Learning Session**


#### **Learning Objectives**

- Understand the value of automating Python code using functions and best practices when authoring functions
- Create a set of Python functions that automate a data cleaning workflow
- Learn the use-cases of package authoring, object-oriented programming and how it enables data democratization
- Understand the value of git, version control, and how it enables collaboration on data projects

#### **The Dataset**

The dataset to be used in this webinar is a CSV file named `airbnb.csv`, which contains data on airbnb listings in the state of New York. It contains the following columns:

- `listing_id`: The unique identifier for a listing
- `description`: The description used on the listing
- `host_id`: Unique identifier for a host
- `host_name`: Name of host
- `neighbourhood_full`: Name of boroughs and neighbourhoods
- `coordinates`: Coordinates of listing _(latitude, longitude)_
- `Listing added`: Date of added listing
- `room_type`: Type of room 
- `rating`: Rating from 0 to 5.
- `price`: Price per night for listing
- `number_of_reviews`: Amount of reviews received 
- `last_review`: Date of last review
- `reviews_per_month`: Number of reviews per month
- `availability_365`: Number of days available per year
- `Number of stays`: Total number of stays thus far


## **Introduction to Functions and Methods in Python**

### **Functions**

A simple definition of functions in Python is that they are a piece of **re-usable code** we can use to solve a **particluar task**. 

For example, `pd.read_csv()` that we just used is a `pandas` function that allows us to read csv files. 

There are built-in Python function such as:

```
# The max() function lets you find the maximum value in a list
my_list = [1,2,3,4,5]
max(my_list)

5
```

```
# The type() function that lets you determine the type of a variable
my_word = "Hello World!"
type(my_word)

str
```

### **Methods**

A method in Python is a function that belongs to an object. In Python, objects can be integers, strings, lists, DataFrames and more. For example, the `airbnb` DataFrame is a Pandas DataFrame object

```
# Get the type of airbnb
airbnb = pd.read_csv("aibnb.csv")
type(airbnb)

pandas.core.frame.DataFrame
```

Every object in Python has built-in methods that you can use to access functionality in the object — for example:

```
# Use .capitalize() to capitalize strings
my_word = "hello"
my_word.capitalize()

"Hello"
```

```
# Use .head() to print first 5 rows of a DataFrame
airbnb.head()
```

### **Creating your own functions**

To create your own function, you can use the following, illustrated with an example of a function named `double()` that doubles any value and returns the output

In [14]:
# Define a function named double
def double(value):
 
  # Create a new value that doubles given value
  new_value = value * 2
 
  # Return new value
  return new_value

### **Adding docstrings**

Docstrings are incredible tools for making functions shareable across your teams. If you want to understand someone else's code, or explain your code whenever someone accesses it, docstrings are your friend. We can add docstrings by using triple-quotes as the following, and ideally a docstring should cover the following:

- Description of the function and what it does
- Decscription of arguments if any *(if applicable)*
- Descritption of the return values
- Descriptions of errors raised *(if applicable)*
- Optional notes *(if applicable)*

In [16]:
# Define a function named double with docstrings
def double(value):
  """
  Returns the double of a given numeric value

  Arguments
  ---------
  value: The value to double

  Returns
  -------
  The doubled value
  """
 
  # Create a new value that doubles given value
  new_value = value * 2
 
  # Return new value
  return new_value

In [None]:
# You can retrieve a docstring from any function using
print(double.__doc__)



---

<center><h1> Q&A 1</h1> </center>

---





## **Getting Started**

In [23]:
# Import libraries
import pandas as pd

*To import a CSV file into* `pandas` , *we use*  `data = pd.read_csv(file_path)` *check out this [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for importing other data types*


In [33]:
# Read in the dataset
airbnb = pd.read_csv('https://github.com/adelnehme/advanced-python/blob/master/data/airbnb.csv?raw=true')

In [28]:
# Inspect the first five rows of airbnb with .head()
airbnb.head()

Unnamed: 0,listing_id,5_stars,availability_365,borough,coordinates,host_id,host_name,is_rated,last_review,name,neighbourhood,number_of_reviews,number_of_stays,reviews_per_month,room_type,price,rating,listing_added
0,3831,0.757366,194,Brooklyn,"(40.68514, -73.95976)",4869,LisaRoxanne,1,2019-07-05,Cozy Entire Floor of Brownstone,Clinton Hill,270,324.0,4.64,Entire home/apt,89.0,3.273935,2018-12-30
1,6848,0.789743,46,Brooklyn,"(40.70837, -73.95352)",15991,Allen & Irina,1,2019-06-29,Only 2 stops to Manhattan studio,Williamsburg,148,177.6,1.2,Entire home/apt,140.0,3.49576,2018-12-24
2,7322,0.669873,12,Manhattan,"(40.74192, -73.99501)",18946,Doti,1,2019-07-01,Chelsea Perfect,Chelsea,260,312.0,2.12,Private room,140.0,4.389051,2018-12-26
3,7726,0.640251,21,Brooklyn,"(40.67592, -73.94694)",20950,Adam And Charity,1,2019-06-22,Hip Historic Brownstone Apartment with Backyard,Crown Heights,53,63.6,4.44,Entire home/apt,99.0,3.305382,2018-12-17
4,12303,0.918593,311,Brooklyn,"(40.69673, -73.97584)",47618,Yolande,1,2018-09-30,1bdr w private bath. in lofty apt,Fort Greene,25,30.0,0.23,Private room,120.0,4.568745,2018-03-27


**Problem 1**: We need to split the `coordinates` column into 2 columns, one for lattitude and one for longitude

In [29]:
# Use .info() to detect missing values
airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   listing_id         9999 non-null   int64  
 1   5_stars            9999 non-null   float64
 2   availability_365   9999 non-null   int64  
 3   borough            9999 non-null   object 
 4   coordinates        9999 non-null   object 
 5   host_id            9999 non-null   int64  
 6   host_name          9999 non-null   object 
 7   is_rated           9999 non-null   int64  
 8   last_review        9999 non-null   object 
 9   name               9999 non-null   object 
 10  neighbourhood      9999 non-null   object 
 11  number_of_reviews  9999 non-null   int64  
 12  number_of_stays    9999 non-null   float64
 13  reviews_per_month  9999 non-null   float64
 14  room_type          9999 non-null   object 
 15  price              9761 non-null   float64
 16  rating             9999 

**Problem 2**: We need to deal with the msising values of the `price` column — by replacing each missing value with the median price of a room in its room type.

In [30]:
# Check out unique values of room_type
airbnb['room_type'].unique()

array(['Entire home/apt', 'Private room', 'Shared room',
       '   Shared room      ', 'Private', 'home', 'PRIVATE ROOM'],
      dtype=object)

**Problem 3**: We need to collapse different values of `room_type` so that the unique values are

```
['Private Room', 'Entire place', 'Shared room']
```



---

<center><h1> Q&A 2</h1> </center>

---





## **Automating data cleaning with functions** 

##### **Task 1:** Replace `coordinates` with `latitude` and `longitude` columns

When this task is performed functionally, what is usually done is the following:

```
# Remove parenthesis "(" and ")" from coordinates using .str.strip()
airbnb['coordinates'] = airbnb['coordinates'].str.strip("(")
airbnb['coordinates'] = airbnb['coordinates'].str.strip(")")

# Create new lat_long object hosting split values using .str.split(',', expand = True)
lat_long = airbnb['coordinates'].str.split(",", expand = True)
print(lat_long.head())

          0           1
0  40.68514   -73.95976
1  40.70837   -73.95352
2  40.74192   -73.99501
3  40.67592   -73.94694
4  40.69673   -73.97584

# Assign new columns in airbnb
airbnb['latitude'] = lat_long[0]
airbnb['longitude'] = lat_long[1]

# Convert them to float
airbnb['latitude'] = airbnb['latitude'].astype(float)
airbnb['longitude'] = airbnb['longitude'].astype(float)
```

In [38]:
# Create a new function that splits coordinates
def split_coordinates(column):
  """
  A function that splits a coordinates in (lat, long) form
  to latitude and longitude 

  Arguments
  ----------
  column in a DataFrame

  Returns
  -------
  Latitude and Longitude series as floats
  """
  
  # Remove parenthesis "(" and ")"
  column = column.str.strip('(')
  column = column.str.strip(')')

  # Split coordinates into lat_long and convert to float
  lat_long = column.str.split(",", expand = True)
  lat_long[0] = lat_long[0].astype(float)
  lat_long[1] = lat_long[1].astype(float)

  # Return lattitude and longitude separately
  return lat_long[0], lat_long[1]

In [None]:
# Create latitude and longitude columns
airbnb['latitude'], airbnb['longitude'] = split_coordinates(airbnb['coordinates'])

# Print header
airbnb.head()



---

<center><h1> Q&A 3</h1> </center>

---





**Task 2**: We need to collapse `room_type` into correct categories

In [41]:
# Check out current unique values
airbnb['room_type'].unique()

array(['Entire home/apt', 'Private room', 'Shared room',
       '   Shared room      ', 'Private', 'home', 'PRIVATE ROOM'],
      dtype=object)

When this task is performed functionally, what is usually done is the following:

```
# Deal with capitalized values
airbnb['room_type'] = airbnb['room_type'].str.lower()

# Deal with trailing spaces
airbnb['room_type'] = airbnb['room_type'].str.strip()
airbnb['room_type'].unique()

array(['private room', 'entire home/apt', 'private', 'shared room',
       'home'], dtype=object)

# Map values
mappings = {'private room': 'Private Room', 
            'private': 'Private Room',
            'entire home/apt': 'Entire place',
            'shared room': 'Shared room',
            'home': 'Entire place'}

# Replace values and collapse data
airbnb['room_type'] = airbnb['room_type'].replace(mappings)
airbnb['room_type'].unique()


In [47]:
# Create a new function that cleans and remaps categorical text columns
def remap_text_columns(column, mapping):
  """
  A function that cleans and remaps categorical text columns

  Arguments
  ----------
  column in a DataFrame
  mapping wished to be applied on categorical values

  Returns
  -------
  Updated column
  """

  # Deal with capitalized values and trailing spaces
  column = column.str.lower()
  column = column.str.strip()

  # Perform mapping and return
  column = column.replace(mapping)
  
  return column

In [46]:
# Clean room_type column
mappings = {'private room': 'Private Room', 
            'private': 'Private Room',
            'entire home/apt': 'Entire place',
            'shared room': 'Shared room',
            'home': 'Entire place'}

# Clean room_type column
airbnb['room_type'] = remap_text_columns(airbnb['room_type'], mappings)

# Check out unique values
airbnb['room_type'].unique()

array(['Entire place', 'Private Room', 'Shared room'], dtype=object)



---

<center><h1> Q&A 4</h1> </center>

---





**Task 3**: Deal with missing data in `price` column by imputing missing values from a given room type with the median price of said room type

In [None]:
# Get median price per room_type
airbnb.groupby('room_type').median()['price']

room_type
Entire place    163.0
Private Room     70.0
Shared room      50.0
Name: price, dtype: float64

In [None]:
# Impute price based on conditions
airbnb.loc[(airbnb['price'].isna()) & (airbnb['room_type'] == 'Entire place'), 'price'] = 163.0
airbnb.loc[(airbnb['price'].isna()) & (airbnb['room_type'] == 'Private Room'), 'price'] = 70.0
airbnb.loc[(airbnb['price'].isna()) & (airbnb['room_type'] == 'Shared Room'), 'price'] = 50.0

In [None]:
# Confirm price has been imputed
airbnb.isna().sum()

listing_id              0
name                    5
host_id                 0
host_name               2
room_type               0
price                   7
number_of_reviews       0
last_review          2075
reviews_per_month       0
availability_365        0
rating               2075
number_of_stays         0
5_stars                 0
listing_added           0
latitude                0
longitude               0
borough                 0
neighbourhood           0
is_rated                0
dtype: int64

### What's still to be done?

Albeit we've done a significant amount of data cleaning tasks, there are still a couple of problems we have yet to diagnose. When cleaning data, we need to consider:

- Values that do not make any sense *(for example: are there values of `last_review` that older than `listing_added`? Are there listings in the future?*)
- Presence of duplicates values - and how to deal with them?

##### **Task 8:** Do we have consistent date data?

In [None]:
# Doing some sanity checks on date data
today = dt.date.today()

In [None]:
# Are there reviews in the future?
airbnb[airbnb['last_review'].dt.date > today]

Unnamed: 0,listing_id,name,host_id,host_name,room_type,price,number_of_reviews,last_review,reviews_per_month,availability_365,rating,number_of_stays,5_stars,listing_added,latitude,longitude,borough,neighbourhood,is_rated


In [None]:
# Are there listings in the future?
airbnb[airbnb['listing_added'].dt.date > today]

Unnamed: 0,listing_id,name,host_id,host_name,room_type,price,number_of_reviews,last_review,reviews_per_month,availability_365,rating,number_of_stays,5_stars,listing_added,latitude,longitude,borough,neighbourhood,is_rated
4,22986519,Bedroom on the lively Lower East Side,154262349,Brooke,Private Room,160.0,23,2019-06-12,2.29,102,3.822591,27.6,0.649383,2020-10-23,40.71884,-73.98354,Manhattan,Lower East Side,1
124,28659894,Private bedroom in prime Bushwick! Near Trains!!!,216235179,Nina,Private Room,55.0,4,2019-04-12,0.58,358,4.916252,4.8,0.703117,2020-08-23,40.69988,-73.92072,Brooklyn,Bushwick,1
511,33619855,Modern & Spacious in trendy Crown Heights,253354074,Yehudis,Entire place,150.0,6,2019-05-27,2.5,148,3.462432,7.2,0.610929,2020-10-07,40.66387,-73.9384,Brooklyn,Crown Heights,1
521,25317793,Awesome Cozy Room in The Heart of Sunnyside!,136406167,Kara,Private Room,65.0,22,2019-06-11,1.63,131,4.442485,26.4,0.722388,2020-10-22,40.7409,-73.92696,Queens,Sunnyside,1


In [None]:
# Drop these rows since they are only 4 rows
airbnb = airbnb[~(airbnb['listing_added'].dt.date > today)]

In [None]:
# Are there any listings with listing_added > last_review
inconsistent_dates = airbnb[airbnb['listing_added'].dt.date > airbnb['last_review'].dt.date]
inconsistent_dates

Unnamed: 0,listing_id,name,host_id,host_name,room_type,price,number_of_reviews,last_review,reviews_per_month,availability_365,rating,number_of_stays,5_stars,listing_added,latitude,longitude,borough,neighbourhood,is_rated
50,20783900,Marvelous Manhattan Marble Hill Private Suites,148960265,Randy,Private Room,93.0,7,2018-10-06,0.32,0,4.868036,8.4,0.609263,2020-02-17,40.87618,-73.91266,Manhattan,Marble Hill,1
60,1908852,Oversized Studio By Columbus Circle,684629,Alana,Entire place,189.0,7,2016-05-06,0.13,0,4.841204,8.4,0.725995,2017-09-17,40.7706,-73.98919,Manhattan,Upper West Side,1


In [None]:
# Drop these rows since they are only 2 rows
airbnb.drop(inconsistent_dates.index, inplace = True)

##### **Task 9:** Let's deal with duplicate data


There are two notable types of duplicate data:

- Identical duplicate data across all columns
- Identical duplicate data cross most or some columns

To diagnose, and deal with duplicate data, we will be using the following methods and functions:

- `.duplicated(subset = , keep = )`
  - `subset` lets us pick one or more columns with duplicate values.
  - `keep` returns lets us return all instances of duplicate values.
- `.drop_duplicates(subset = , keep = )`
  

In [None]:
# Print the header of the DataFrame again
airbnb.head()

Unnamed: 0,listing_id,name,host_id,host_name,room_type,price,number_of_reviews,last_review,reviews_per_month,availability_365,rating,number_of_stays,5_stars,listing_added,latitude,longitude,borough,neighbourhood,is_rated
0,13740704,"Cozy,budget friendly, cable inc, private entra...",20583125,Michel,Private Room,45.0,10,2018-12-12,0.7,85,4.100954,12.0,0.609432,2018-06-08,40.63222,-73.93398,Brooklyn,Flatlands,1
1,22005115,Two floor apartment near Central Park,82746113,Cecilia,Entire place,135.0,1,2019-06-30,1.0,145,3.3676,1.2,0.746135,2018-12-25,40.78761,-73.96862,Manhattan,Upper West Side,1
2,21667615,Beautiful 1BR in Brooklyn Heights,78251,Leslie,Entire place,150.0,0,NaT,0.0,65,,0.0,0.0,2018-08-15,40.7007,-73.99517,Brooklyn,Brooklyn Heights,0
3,6425850,"Spacious, charming studio",32715865,Yelena,Entire place,86.0,5,2017-09-23,0.13,0,4.763203,6.0,0.769947,2017-03-20,40.79169,-73.97498,Manhattan,Upper West Side,1
5,271954,Beautiful brownstone apartment,1423798,Aj,Entire place,150.0,203,2019-06-20,2.22,300,4.478396,243.6,0.7435,2018-12-15,40.73388,-73.99452,Manhattan,Greenwich Village,1


In [None]:
# Find duplicates
duplicates = airbnb.duplicated(subset = 'listing_id', keep = False)
print(duplicates)

0        False
1        False
2        False
3        False
5        False
         ...  
10014    False
10015    False
10016    False
10017    False
10018    False
Length: 10010, dtype: bool


In [None]:
# Find duplicates
airbnb[duplicates].sort_values('listing_id')

Unnamed: 0,listing_id,name,host_id,host_name,room_type,price,number_of_reviews,last_review,reviews_per_month,availability_365,rating,number_of_stays,5_stars,listing_added,latitude,longitude,borough,neighbourhood,is_rated
1145,253806,Loft Suite @ The Box House Hotel,417504,The Box House Hotel,Entire place,199.0,43,2019-07-02,0.47,60,4.620238,51.6,0.861086,2018-12-27,40.73652,-73.95236,Brooklyn,Greenpoint,1
6562,253806,Loft Suite @ The Box House Hotel,417504,The Box House Hotel,Entire place,199.0,43,2019-07-02,0.47,60,4.620238,51.6,0.861086,2018-12-27,40.73652,-73.95236,Brooklyn,Greenpoint,1
8699,2044392,The heart of Williamsburg 2 bedroom,620218,Sarah,Entire place,245.0,0,NaT,0.0,0,,0.0,0.0,2018-08-09,40.71257,-73.96149,Brooklyn,Williamsburg,0
5761,2044392,The heart of Williamsburg 2 bedroom,620218,Sarah,Entire place,250.0,0,NaT,0.0,0,,0.0,0.0,2018-05-24,40.71257,-73.96149,Brooklyn,Williamsburg,0
4187,4244242,Best Bedroom in Bedstuy/Bushwick. Ensuite bath...,22023014,BrooklynSleeps,Private Room,73.0,110,2019-06-23,1.96,323,4.962314,132.0,0.809882,2018-12-18,40.69496,-73.93949,Brooklyn,Bedford-Stuyvesant,1
2871,4244242,Best Bedroom in Bedstuy/Bushwick. Ensuite bath...,22023014,BrooklynSleeps,Private Room,70.0,110,2019-06-23,1.96,323,4.962314,132.0,0.809882,2018-12-18,40.69496,-73.93949,Brooklyn,Bedford-Stuyvesant,1
77,7319856,450ft Square Studio in Gramercy NY,11773680,Adam,Entire place,289.0,4,2016-05-22,0.09,225,3.903764,4.8,0.756381,2015-11-17,40.73813,-73.98098,Manhattan,Kips Bay,1
2255,7319856,450ft Square Studio in Gramercy NY,11773680,Adam,Entire place,280.0,4,2016-05-22,0.09,225,3.903764,4.8,0.756381,2015-11-17,40.73813,-73.98098,Manhattan,Kips Bay,1
555,9078222,"Prospect Park 3 bdrm, Sleeps 8 (#2)",47219962,Babajide,Entire place,154.0,123,2019-07-01,2.74,263,3.466881,147.6,0.738191,2018-12-26,40.66086,-73.96159,Brooklyn,Prospect-Lefferts Gardens,1
7933,9078222,"Prospect Park 3 bdrm, Sleeps 8 (#2)",47219962,Babajide,Entire place,150.0,123,2019-07-01,2.74,263,3.466881,147.6,0.738191,2018-12-26,40.66086,-73.96159,Brooklyn,Prospect-Lefferts Gardens,1


In [None]:
# Remove identical duplicates
airbnb = airbnb.drop_duplicates()

In [None]:
# Find non-identical duplicates
duplicates = airbnb.duplicated(subset = 'listing_id', keep = False)

In [None]:
# Show all duplicates
airbnb[duplicates].sort_values('listing_id')

Unnamed: 0,listing_id,name,host_id,host_name,room_type,price,number_of_reviews,last_review,reviews_per_month,availability_365,rating,number_of_stays,5_stars,listing_added,latitude,longitude,borough,neighbourhood,is_rated
5761,2044392,The heart of Williamsburg 2 bedroom,620218,Sarah,Entire place,250.0,0,NaT,0.0,0,,0.0,0.0,2018-05-24,40.71257,-73.96149,Brooklyn,Williamsburg,0
8699,2044392,The heart of Williamsburg 2 bedroom,620218,Sarah,Entire place,245.0,0,NaT,0.0,0,,0.0,0.0,2018-08-09,40.71257,-73.96149,Brooklyn,Williamsburg,0
2871,4244242,Best Bedroom in Bedstuy/Bushwick. Ensuite bath...,22023014,BrooklynSleeps,Private Room,70.0,110,2019-06-23,1.96,323,4.962314,132.0,0.809882,2018-12-18,40.69496,-73.93949,Brooklyn,Bedford-Stuyvesant,1
4187,4244242,Best Bedroom in Bedstuy/Bushwick. Ensuite bath...,22023014,BrooklynSleeps,Private Room,73.0,110,2019-06-23,1.96,323,4.962314,132.0,0.809882,2018-12-18,40.69496,-73.93949,Brooklyn,Bedford-Stuyvesant,1
77,7319856,450ft Square Studio in Gramercy NY,11773680,Adam,Entire place,289.0,4,2016-05-22,0.09,225,3.903764,4.8,0.756381,2015-11-17,40.73813,-73.98098,Manhattan,Kips Bay,1
2255,7319856,450ft Square Studio in Gramercy NY,11773680,Adam,Entire place,280.0,4,2016-05-22,0.09,225,3.903764,4.8,0.756381,2015-11-17,40.73813,-73.98098,Manhattan,Kips Bay,1
555,9078222,"Prospect Park 3 bdrm, Sleeps 8 (#2)",47219962,Babajide,Entire place,154.0,123,2019-07-01,2.74,263,3.466881,147.6,0.738191,2018-12-26,40.66086,-73.96159,Brooklyn,Prospect-Lefferts Gardens,1
7933,9078222,"Prospect Park 3 bdrm, Sleeps 8 (#2)",47219962,Babajide,Entire place,150.0,123,2019-07-01,2.74,263,3.466881,147.6,0.738191,2018-12-26,40.66086,-73.96159,Brooklyn,Prospect-Lefferts Gardens,1
1481,15027024,Newly renovated 1bd on lively & historic St Marks,8344620,Ethan,Entire place,180.0,10,2018-12-31,0.3,0,3.969729,12.0,0.772513,2018-06-27,40.72693,-73.98385,Manhattan,East Village,1
3430,15027024,Newly renovated 1bd on lively & historic St Marks,8344620,Ethan,Entire place,180.0,10,2018-12-31,0.3,0,3.869729,12.0,0.772513,2018-06-27,40.72693,-73.98385,Manhattan,East Village,1


To treat identical duplicates across some columns, we will chain the `.groupby()` and `.agg()` methods where we group by the column used to find duplicates (`listing_id`) and aggregate across statistical measures for `price`, `rating` and `list_added`. The `.agg()` method takes in a dictionary with each column's aggregation method - we will use the following aggregations:

- `mean` for `price` and `rating` columns
- `max` for `listing_added` column
- `first` for all remaining column

*A note on dictionary comprehensions:*

Dictionaries are useful data structures in Python with the following format
`my_dictionary = {key: value}` where a `key` is mapped to a `value` and whose `value` can be returned with `my_dictionary[key]` - dictionary comprehensions allow us to programmatically create dicitonaries using the structure:

```
{x: x*2 for x in [1,2,3,4,5]} 
{1:2, 2:4, 3:6, 4:8, 5:10}
```

In [None]:
# Get column names from airbnb
column_names = airbnb.columns
column_names

Index(['listing_id', '5_stars', 'availability_365', 'borough', 'host_id',
       'host_name', 'is_rated', 'last_review', 'latitude', 'longitude', 'name',
       'neighbourhood', 'number_of_reviews', 'number_of_stays',
       'reviews_per_month', 'room_type', 'price', 'rating', 'listing_added'],
      dtype='object')

In [None]:
# Create dictionary comprehension with 'first' as value for all columns not being aggregated
aggregations = {column_name:'first' for column_name in column_names.difference(['listing_id', 'listing_added', 'rating', 'price'])}
aggregations['price'] = 'mean'
aggregations['rating'] = 'mean'
aggregations['listing_added'] = 'max'
aggregations

{'5_stars': 'first',
 'availability_365': 'first',
 'borough': 'first',
 'host_id': 'first',
 'host_name': 'first',
 'is_rated': 'first',
 'last_review': 'first',
 'latitude': 'first',
 'listing_added': 'max',
 'longitude': 'first',
 'name': 'first',
 'neighbourhood': 'first',
 'number_of_reviews': 'first',
 'number_of_stays': 'first',
 'price': 'mean',
 'rating': 'mean',
 'reviews_per_month': 'first',
 'room_type': 'first'}

In [None]:
# Remove non-identical duplicates
airbnb = airbnb.groupby('listing_id').agg(aggregations).reset_index()

In [None]:
# Make sure no duplication happened
airbnb[airbnb.duplicated('listing_id', keep = False)]

Unnamed: 0,listing_id,5_stars,availability_365,borough,host_id,host_name,is_rated,last_review,latitude,longitude,name,neighbourhood,number_of_reviews,number_of_stays,reviews_per_month,room_type,price,rating,listing_added


In [None]:
# Print header of DataFrame
airbnb.head()

Unnamed: 0,listing_id,5_stars,availability_365,borough,host_id,host_name,is_rated,last_review,latitude,longitude,name,neighbourhood,number_of_reviews,number_of_stays,reviews_per_month,room_type,price,rating,listing_added
0,3831,0.757366,194,Brooklyn,4869,LisaRoxanne,1,2019-07-05,40.68514,-73.95976,Cozy Entire Floor of Brownstone,Clinton Hill,270,324.0,4.64,Entire place,89.0,3.273935,2018-12-30
1,6848,0.789743,46,Brooklyn,15991,Allen & Irina,1,2019-06-29,40.70837,-73.95352,Only 2 stops to Manhattan studio,Williamsburg,148,177.6,1.2,Entire place,140.0,3.49576,2018-12-24
2,7322,0.669873,12,Manhattan,18946,Doti,1,2019-07-01,40.74192,-73.99501,Chelsea Perfect,Chelsea,260,312.0,2.12,Private Room,140.0,4.389051,2018-12-26
3,7726,0.640251,21,Brooklyn,20950,Adam And Charity,1,2019-06-22,40.67592,-73.94694,Hip Historic Brownstone Apartment with Backyard,Crown Heights,53,63.6,4.44,Entire place,99.0,3.305382,2018-12-17
4,12303,0.918593,311,Brooklyn,47618,Yolande,1,2018-09-30,40.69673,-73.97584,1bdr w private bath. in lofty apt,Fort Greene,25,30.0,0.23,Private Room,120.0,4.568745,2018-03-27


## **Q&A**

### Take home question

Try to answer the following questions about the dataset:

- What is the average price of listings by borough? Visualize your results with a bar plot!
- What is the average availability in days of listings by borough? Visualize your results with a bar plot!
- What is the median price per room type in each borough? Visualize your results with a bar plot!
- Visualize the number of listings over time.

**Functions that should/could be used:**
- `.groupby()` and `.agg(})`
- `sns.barplot(x = , y = , hue = , data = )`
- `sns.lineplot(x = , y = , data = )`
- `.dt.strftime()` for extracting specific dates from a `datetime` column

**Bonus points if:**
- You finish more than one question

**Submission details:**
- Share with us a code snippet with your output on LinkedIn, Twitter or Facebook
- Tag us on `@DataCamp` with the hashtag `#datacamplive`
