<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200> 

# Assignment:

- Replicate the lesson code.

 - This means that if you haven't followed along already, type out the things that we did in class. Forcing your fingers to hit each key will help you internalize the syntax of what we're doing. Make sure you understand each line of code that you're writing, google things that you don't fully understand.
 - [Lambda Learning Method for DS - By Ryan Herr](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit?usp=sharing)
- Convert the `term` column from string to integer.
- Make a column named `loan_status_is_great`. It should contain the integer 1 if `loan_status` is "Current" or "Fully Paid." Else it should contain the integer 0.
- Make `last_pymnt_d_month` and `last_pymnt_d_year` columns.

In [1]:
##### Begin Working Here #####
# Import necessary packages and set display options for Dataset.

import pandas as pd
import numpy as np

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 1461)

In [None]:
# Fetch LendingClub Data via BASH command
!wget 'https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip'

# Unzip dataset to CSV
!unzip LoanStats_2018Q4.csv.zip

--2020-09-05 02:07:55--  https://resources.lendingclub.com/LoanStats_2018Q4.csv.zip
Resolving resources.lendingclub.com (resources.lendingclub.com)... 54.148.13.215, 35.161.89.82, 54.244.115.45
Connecting to resources.lendingclub.com (resources.lendingclub.com)|54.148.13.215|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/zip]
Saving to: ‘LoanStats_2018Q4.csv.zip.1’

LoanStats_2018Q4.cs     [            <=>     ]  22.28M  2.23MB/s    in 10s     

2020-09-05 02:08:06 (2.21 MB/s) - ‘LoanStats_2018Q4.csv.zip.1’ saved [23360898]

Archive:  LoanStats_2018Q4.csv.zip
replace LoanStats_2018Q4.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
# Read in the unzipped CSV file to Pandas
df = pd.read_csv('LoanStats_2018Q4.csv');

# Examine the head to insure proper formating
df.head(10)

##### Looks like the first line is causing some issues with NaN values! Let's clean this up by skipping that first row with the URL!

In [None]:
# Assign skiprow to ignore the URL
df = pd.read_csv('LoanStats_2018Q4.csv', skiprows=1)

df.head()

##### Looks better! Now we can take a look at these NaN values!

In [None]:
df.isnull().sum().sort_values()

##### Well that's odd....we have a TON of 2's...let's take a look at the tail and see what's going on.

In [None]:
df.tail()

##### Looks like we have some extra values in the last two rows as well. Let's clean this up!

In [None]:
# Reload dataframe so that the first line is skipped and the last two lines are skipped.
# We will also pass the 'engine=python' argument to avoid the C error.
df = pd.read_csv('LoanStats_2018Q4.csv', header=1, skipfooter=2, engine='python')

df.tail(5)

##### Now that we have our table formatted properly we can get to work! Let's start by dropping any "unnecessary" values from our dataset that are equal to NaN!

In [None]:
df = df.drop(['id', 'member_id', 'url', 'desc', 'settlement_status', 
              'settlement_amount', 'settlement_date', 'settlement_term', 
              'settlement_percentage', 'debt_settlement_flag_date'], axis=1)

In [None]:
df.head()

##### Now we'll take a look at the 'int_rate' column and see if we can clean it up. It shows a '%' sign now which indicates that this is a string value.

In [None]:
# Examine 'int_rate'
df['int_rate'].head(10)

##### Looks like we were right! Let's drop any whitespace and the '%' from this column!

In [None]:
# First we will define a function to make our lives easier
def intrate_to_float(cell_contents):
  return float(cell_contents.strip().strip('%'))

# Now we will use .apply to call the function on our column
df['int_rate'] = df['int_rate'].apply(intrate_to_float)

##### Looks like we want it to! Let's override the previous column with this new list!

In [None]:
# Assign our new list over the 'int_rate' column
df['int_rate'] = pd.Series(cleanedint)

In [None]:
df.head()

##### Looks good! But let's check with dtype to be sure.

In [None]:
# Call dtypes on 'int_rate' to ensure float
df['int_rate'].dtypes

# Stretch Goals

You can do more with the LendingClub or Instacart datasets.

LendingClub options:
- There's one other column in the dataframe with percent signs. Remove them and convert to floats. You'll need to handle missing values.
- Modify the `emp_title` column to replace titles with 'Other' if the title is not in the top 20. 
- Take initiatve and work on your own ideas!

Instacart options:
- Read [Instacart Market Basket Analysis, Winner's Interview: 2nd place, Kazuki Onodera](http://blog.kaggle.com/2017/09/21/instacart-market-basket-analysis-winners-interview-2nd-place-kazuki-onodera/), especially the **Feature Engineering** section. (Can you choose one feature from his bulleted lists, and try to engineer it with pandas code?)
- Read and replicate parts of [Simple Exploration Notebook - Instacart](https://www.kaggle.com/sudalairajkumar/simple-exploration-notebook-instacart). (It's the Python Notebook with the most upvotes for this Kaggle competition.)
- Take initiative and work on your own ideas!

You can uncomment and run the cells below to re-download and extract the Instacart data

In [None]:
# !wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

In [None]:
# !tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

In [None]:
# %cd instacart_2017_05_01