<p align="center">
<img src="https://github.com/adelnehme/advanced-python/blob/master/assets/hsbc_datacamp.png?raw=True" alt = "DataCamp icon" width="50%">
</p>
<br>

## **Advanced Python Learning Session**


#### **Learning Objectives**

- Understand the value of automating Python code using functions and best practices when authoring functions
- Create a set of Python functions that automate a data cleaning workflow
- Learn the use-cases of package authoring, object-oriented programming and how it enables data democratization
- Understand the value of git, version control, and how it enables collaboration on data projects

#### **The Dataset**

The dataset to be used in this webinar is a CSV file named `airbnb.csv`, which contains data on airbnb listings in the state of New York. It contains the following columns:

- `listing_id`: The unique identifier for a listing
- `description`: The description used on the listing
- `host_id`: Unique identifier for a host
- `host_name`: Name of host
- `neighbourhood_full`: Name of boroughs and neighbourhoods
- `coordinates`: Coordinates of listing _(latitude, longitude)_
- `Listing added`: Date of added listing
- `room_type`: Type of room 
- `rating`: Rating from 0 to 5.
- `price`: Price per night for listing
- `number_of_reviews`: Amount of reviews received 
- `last_review`: Date of last review
- `reviews_per_month`: Number of reviews per month
- `availability_365`: Number of days available per year
- `Number of stays`: Total number of stays thus far


## **Introduction to Functions and Methods in Python**

### **Functions**

A simple definition of functions in Python is that they are a piece of **re-usable code** we can use to solve a **particluar task**. 

For example, `pd.read_csv()` that we just used is a `pandas` function that allows us to read csv files. 

There are built-in Python function such as:

```
# The max() function lets you find the maximum value in a list
my_list = [1,2,3,4,5]
max(my_list)

5
```

```
# The type() function that lets you determine the type of a variable
my_word = "Hello World!"
type(my_word)

str
```

### **Methods**

A method in Python is a function that belongs to an object. In Python, objects can be integers, strings, lists, DataFrames and more. For example, the `airbnb` DataFrame is a Pandas DataFrame object

```
# Get the type of airbnb
airbnb = pd.read_csv("aibnb.csv")
type(airbnb)

pandas.core.frame.DataFrame
```

Every object in Python has built-in methods that you can use to access functionality in the object — for example:

```
# Use .capitalize() to capitalize strings
my_word = "hello"
my_word.capitalize()

"Hello"
```

```
# Use .head() to print first 5 rows of a DataFrame
airbnb.head()
```

### **Creating your own functions**

To create your own function, you can use the following, illustrated with an example of a function named `double()` that doubles any value and returns the output

In [92]:
# Define a function named double
def double(value):
 
  # Create a new value that doubles given value
  new_value = value * 2
 
  # Return new value
  return new_value

In [93]:
# Test it
double(4)

8

In [94]:
# Define a function that adds two numbers to eachother
def add_values(value_a, value_b):
  
  # Create new value
  new_value = value_a + value_b

  # Return new value
  return new_value

In [96]:
# Test it
add_values(4, 5)

9

### **Adding docstrings**

Docstrings are incredible tools for making functions shareable across your teams. If you want to understand someone else's code, or explain your code whenever someone accesses it, docstrings are your friend. We can add docstrings by using triple-quotes as the following, and ideally a docstring should cover the following:

- Description of the function and what it does
- Decscription of arguments if any *(if applicable)*
- Descritption of the return values
- Descriptions of errors raised *(if applicable)*
- Optional notes *(if applicable)*

In [None]:
# Define a function named double with docstrings
def double(value):
  """
  Returns the double of a given numeric value

  Arguments
  ---------
  value: The value to double

  Returns
  -------
  The doubled value
  """
 
  # Create a new value that doubles given value
  new_value = value * 2
 
  # Return new value
  return new_value

In [None]:
# You can retrieve a docstring from any function using
print(double.__doc__)



---

<center><h1> Q&A 1</h1> </center>

---





## **Getting Started**

In [None]:
# Import libraries
import pandas as pd

*To import a CSV file into* `pandas` , *we use*  `data = pd.read_csv(file_path)` *check out this [documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html) for importing other data types*


In [None]:
# Read in the dataset
airbnb = pd.read_csv('https://github.com/adelnehme/advanced-python/blob/master/data/airbnb.csv?raw=true')

In [None]:
# Inspect the first five rows of airbnb with .head()
airbnb.head()

Unnamed: 0,listing_id,5_stars,availability_365,borough,coordinates,host_id,host_name,is_rated,last_review,name,neighbourhood,number_of_reviews,number_of_stays,reviews_per_month,room_type,price,rating,listing_added
0,3831,0.757366,194,Brooklyn,"(40.68514, -73.95976)",4869,LisaRoxanne,1,2019-07-05,Cozy Entire Floor of Brownstone,Clinton Hill,270,324.0,4.64,Entire home/apt,89.0,3.273935,2018-12-30
1,6848,0.789743,46,Brooklyn,"(40.70837, -73.95352)",15991,Allen & Irina,1,2019-06-29,Only 2 stops to Manhattan studio,Williamsburg,148,177.6,1.2,Entire home/apt,140.0,3.49576,2018-12-24
2,7322,0.669873,12,Manhattan,"(40.74192, -73.99501)",18946,Doti,1,2019-07-01,Chelsea Perfect,Chelsea,260,312.0,2.12,Private room,140.0,4.389051,2018-12-26
3,7726,0.640251,21,Brooklyn,"(40.67592, -73.94694)",20950,Adam And Charity,1,2019-06-22,Hip Historic Brownstone Apartment with Backyard,Crown Heights,53,63.6,4.44,Entire home/apt,99.0,3.305382,2018-12-17
4,12303,0.918593,311,Brooklyn,"(40.69673, -73.97584)",47618,Yolande,1,2018-09-30,1bdr w private bath. in lofty apt,Fort Greene,25,30.0,0.23,Private room,120.0,4.568745,2018-03-27


**Problem 1**: We need to split the `coordinates` column into 2 columns, one for lattitude and one for longitude

In [None]:
# Use .info() to detect missing values
airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   listing_id         9999 non-null   int64  
 1   5_stars            9999 non-null   float64
 2   availability_365   9999 non-null   int64  
 3   borough            9999 non-null   object 
 4   coordinates        9999 non-null   object 
 5   host_id            9999 non-null   int64  
 6   host_name          9999 non-null   object 
 7   is_rated           9999 non-null   int64  
 8   last_review        9999 non-null   object 
 9   name               9999 non-null   object 
 10  neighbourhood      9999 non-null   object 
 11  number_of_reviews  9999 non-null   int64  
 12  number_of_stays    9999 non-null   float64
 13  reviews_per_month  9999 non-null   float64
 14  room_type          9999 non-null   object 
 15  price              9761 non-null   float64
 16  rating             9999 

In [None]:
# Check out unique values of room_type
airbnb['room_type'].unique()

array(['Entire home/apt', 'Private room', 'Shared room',
       '   Shared room      ', 'Private', 'home', 'PRIVATE ROOM'],
      dtype=object)

**Problem 2**: We need to collapse different values of `room_type` so that the unique values are

```
['Private Room', 'Entire place', 'Shared room']
```



---

<center><h1> Q&A 2</h1> </center>

---





## **Automating data cleaning with functions** 

##### **Task 1:** Replace `coordinates` with `latitude` and `longitude` columns

When this task is performed functionally, what is usually done is the following:

```
# Remove parenthesis "(" and ")" from coordinates using .str.strip()
airbnb['coordinates'] = airbnb['coordinates'].str.strip("(")
airbnb['coordinates'] = airbnb['coordinates'].str.strip(")")

# Create new lat_long object hosting split values using .str.split(',', expand = True)
lat_long = airbnb['coordinates'].str.split(",", expand = True)
print(lat_long.head())

          0           1
0  40.68514   -73.95976
1  40.70837   -73.95352
2  40.74192   -73.99501
3  40.67592   -73.94694
4  40.69673   -73.97584

# Assign new columns in airbnb
airbnb['latitude'] = lat_long[0]
airbnb['longitude'] = lat_long[1]

# Convert them to float
airbnb['latitude'] = airbnb['latitude'].astype(float)
airbnb['longitude'] = airbnb['longitude'].astype(float)
```

In [None]:
# Create a new function that splits coordinates
def split_coordinates(column):
  """
  A function that splits a coordinates in (lat, long) form
  to latitude and longitude 

  Arguments
  ----------
  column in a DataFrame

  Returns
  -------
  Latitude and Longitude series as floats
  """
  
  # Remove parenthesis "(" and ")"
  updated_column = column.str.strip('(')
  updated_column = updated_column.str.strip(')')

  # Split coordinates into lat_long and convert to float
  lat_long = updated_column.str.split(",", expand = True)
  lat_long[0] = lat_long[0].astype(float)
  lat_long[1] = lat_long[1].astype(float)

  # Return lattitude and longitude separately
  return lat_long[0], lat_long[1]

In [None]:
# Create latitude and longitude columns
airbnb['latitude'], airbnb['longitude'] = split_coordinates(airbnb['coordinates'])

# Print header
airbnb.head()

Unnamed: 0,listing_id,5_stars,availability_365,borough,coordinates,host_id,host_name,is_rated,last_review,name,neighbourhood,number_of_reviews,number_of_stays,reviews_per_month,room_type,price,rating,listing_added,latitude,longitude
0,3831,0.757366,194,Brooklyn,"(40.68514, -73.95976)",4869,LisaRoxanne,1,2019-07-05,Cozy Entire Floor of Brownstone,Clinton Hill,270,324.0,4.64,Entire home/apt,89.0,3.273935,2018-12-30,40.68514,-73.95976
1,6848,0.789743,46,Brooklyn,"(40.70837, -73.95352)",15991,Allen & Irina,1,2019-06-29,Only 2 stops to Manhattan studio,Williamsburg,148,177.6,1.2,Entire home/apt,140.0,3.49576,2018-12-24,40.70837,-73.95352
2,7322,0.669873,12,Manhattan,"(40.74192, -73.99501)",18946,Doti,1,2019-07-01,Chelsea Perfect,Chelsea,260,312.0,2.12,Private room,140.0,4.389051,2018-12-26,40.74192,-73.99501
3,7726,0.640251,21,Brooklyn,"(40.67592, -73.94694)",20950,Adam And Charity,1,2019-06-22,Hip Historic Brownstone Apartment with Backyard,Crown Heights,53,63.6,4.44,Entire home/apt,99.0,3.305382,2018-12-17,40.67592,-73.94694
4,12303,0.918593,311,Brooklyn,"(40.69673, -73.97584)",47618,Yolande,1,2018-09-30,1bdr w private bath. in lofty apt,Fort Greene,25,30.0,0.23,Private room,120.0,4.568745,2018-03-27,40.69673,-73.97584




---

<center><h1> Q&A 3</h1> </center>

---





**Task 2**: We need to collapse `room_type` into correct categories

In [None]:
# Check out current unique values
airbnb['room_type'].unique()

array(['Entire home/apt', 'Private room', 'Shared room',
       '   Shared room      ', 'Private', 'home', 'PRIVATE ROOM'],
      dtype=object)

When this task is performed functionally, what is usually done is the following:

```
# Deal with capitalized values
airbnb['room_type'] = airbnb['room_type'].str.lower()

# Deal with trailing spaces
airbnb['room_type'] = airbnb['room_type'].str.strip()
airbnb['room_type'].unique()

array(['private room', 'entire home/apt', 'private', 'shared room',
       'home'], dtype=object)

# Map values
mappings = {'private room': 'Private Room', 
            'private': 'Private Room',
            'entire home/apt': 'Entire place',
            'shared room': 'Shared room',
            'home': 'Entire place'}

# Replace values and collapse data
airbnb['room_type'] = airbnb['room_type'].replace(mappings)
airbnb['room_type'].unique()


In [None]:
# Create a new function that cleans and remaps categorical text columns
def remap_text_columns(column, mapping):
  """
  A function that cleans and remaps categorical text columns

  Arguments
  ----------
  column in a DataFrame
  mapping wished to be applied on categorical values

  Returns
  -------
  Updated column
  """

  # Deal with capitalized values and trailing spaces
  updated_column = column.str.lower()
  updated_column = updated_column.str.strip()

  # Perform mapping and return
  updated_column = updated_column.replace(mapping)
  
  return updated_column

In [None]:
# Clean room_type column
mappings = {'private room': 'Private Room', 
            'private': 'Private Room',
            'entire home/apt': 'Entire place',
            'shared room': 'Shared room',
            'home': 'Entire place'}

# Clean room_type column
airbnb['room_type'] = remap_text_columns(airbnb['room_type'], mappings)

# Check out unique values
airbnb['room_type'].unique()

array(['Entire place', 'Private Room', 'Shared room'], dtype=object)



---

<center><h1> Q&A 4</h1> </center>

---



