
## Example 1: Peer-to-Peer Lending (Finance)

### The Business Model

Peer-to-peer lending (abbreviated P2P) occurs when investors lend money directly to individuals or businesses through an online service. The online server provider matches lenders with borrowers, and conducts the analysis required to determine the loan interest rate to be charged to the borrower and the risk incurred by the lender. There is usually a lower operating cost to peer-to-peer lending, therefore investors tend to get higher returns and borrowers lower loan rates, although this is now always the case.

### Company: Lending Club

**Lending Club** is a peer-to-peer Lending company based in the US. Lending Club matches people looking to invest money with people looking to borrow money. When investors invest their money through Lending Club, the money is passed onto borrowers, and when borrowers pay their loans back, the capital plus the interest passes on back to the investors. This product is called unsecured personal loans. To learn more about Lending Club visit their [website](https://www.lendingclub.com/).

### The Dataset

The Lending Club dataset contains complete loan data for all loans issued through 2007-2015, including the current loan status (Current, Late, Fully Paid, etc.) and latest payment information. Features include credit scores, number of finance inquiries, address including zip codes and state, and collections among others. Collections indicates whether the customer has missed one or more payments and the team is trying to recover the money.

The dataset contains about 890 thousand observations and 75 variables. More detail on this dataset can be found in [Kaggle's website](https://www.kaggle.com/datasets/adarshsng/lending-club-loan-data-csv?resource=download)


### Download and save

To download the dataset:

- Go to the [Kaggle Website](https://www.kaggle.com/datasets/adarshsng/lending-club-loan-data-csv?resource=download)
- Scroll down and click on the file "loan.csv"
- Click the "Download" button at the top of the screen
- Unzip the file
- Keep the dataset name as "loan.csv"
- Save the file in the *datasets* directory and change the link in notebooks accordingly.

Alternative to downloading the datset:

- You might use my provided google drive link 

**Note the following:**
- You need to be logged in to Kaggle to download the dataset.
- You need may need to accept terms and conditions

## Example 2: Predicting Survival on the Titanic

### Business Problem
Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.

### Dataset

To download the dataset:

- You might find it as **titanic.csv** on the course's Git Repo.
- Save the file in the datasets directory and change the link in notebooks accordingly.

## Example 3: House Prices - Advanced Regression Techniques

### History
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.



### Download and save

To download the dataset:

- Go to the [Kaggle Website](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data)
- Scroll down and click on the file "train.csv"
- Click the "Download" button at the top of the screen
- Unzip the file
- Rename the dataset name as **houseprice.csv**
- Save the file in the *datasets* directory and change the link in notebooks accordingly.

Alternative to downloading the datset:

- You might use my provided google drive link 

**Note the following:**
- You need to be logged in to Kaggle to download the dataset.
- You need may need to accept terms and conditions

## Test your datasets

In [1]:
import pandas as pd

In [6]:
data = pd.read_csv('datasets/titanic.csv')
data.shape, data.head()

((1309, 14),
    pclass  survived                                             name     sex  \
 0       1         1                    Allen, Miss. Elisabeth Walton  female   
 1       1         1                   Allison, Master. Hudson Trevor    male   
 2       1         0                     Allison, Miss. Helen Loraine  female   
 3       1         0             Allison, Mr. Hudson Joshua Creighton    male   
 4       1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   
 
        age  sibsp  parch  ticket      fare cabin embarked boat   body  \
 0  29.0000      0      0   24160  211.3375    B5        S    2    NaN   
 1   0.9167      1      2  113781  151.5500   C22        S   11    NaN   
 2   2.0000      1      2  113781  151.5500   C22        S  NaN    NaN   
 3  30.0000      1      2  113781  151.5500   C22        S  NaN  135.0   
 4  25.0000      1      2  113781  151.5500   C22        S  NaN    NaN   
 
                          home.dest  
 0             

In [5]:
data = pd.read_csv('datasets/houseprice.csv')
data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
def get_first_cabin(row):
    try:
        return row.split()[0]
    except:
        return np.nan 

In [5]:
data['cabin'] = data['cabin'].apply(get_first_cabin)

In [6]:
data.to_csv('../titanic.csv', index=False)