# Notebook Title

## Setup Python and R environment
you can ignore this section

In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

In [2]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [3]:
%%R

# My commonly used R imports

require('tidyverse')

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


Loading required package: tidyverse
1: package ‘readr’ was built under R version 4.2.3 
2: package ‘dplyr’ was built under R version 4.2.3 
3: package ‘stringr’ was built under R version 4.2.3 


## 👉 download your data

You can write code here to download your dataset. Or if you already have it, just leave the URL in the comments and just load it into a pandas or R (or both) dataframe.

In [4]:
%%R
# Load the ny_mortgage_data_2022.csv

mortgage_data <- read_csv('ny_mortgage_data_2022.csv')
mortgage_data

Rows: 548905 Columns: 99
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (27): lei, state_code, conforming_loan_limit, derived_loan_product_type,...
dbl (64): activity_year, derived_msa-md, county_code, census_tract, action_t...
lgl  (8): applicant_ethnicity-4, applicant_ethnicity-5, co-applicant_ethnici...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 548,905 × 99
   activity_year lei        `derived_msa-md` state_code county_code census_tract
           <dbl> <chr>                 <dbl> <chr>            <dbl>        <dbl>
 1          2022 5493000YN…            35614 NY               36005  36005045102
 2          2022 5493006O6…                0 NY                  NA           NA
 3          2022 5493006O6…                0 NY                  NA           NA
 4          2022 5493006O6…                0 NY       

One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
  dat <- vroom(...)
  problems(dat) 


In [5]:
%%R

# Keep only following columns: 'activity_year', 'census_tract','loan_type', 'loan_purpose', 'loan_amount', 'property_value', 'applicant_race_observed','co-applicant_race_observed','applicant_sex_observed','co-applicant_sex_observed','applicant_age','co-applicant_age'

mortgage_data <- mortgage_data %>% select("activity_year", 
                                          "census_tract", 
                                          "loan_type", 
                                          "loan_purpose", 
                                          "loan_amount",
                                          "income",
                                          "property_value", 
                                          "applicant_race-1",
                                          "co-applicant_race-1",
                                          "applicant_sex",
                                          "co-applicant_sex",
                                          "applicant_age",
                                          "co-applicant_age")

mortgage_data

# A tibble: 548,905 × 13
   activity_year census_tract loan_type loan_purpose loan_amount income
           <dbl>        <dbl>     <dbl>        <dbl>       <dbl>  <dbl>
 1          2022  36005045102         1            1      125000     39
 2          2022           NA         1            1      215000     NA
 3          2022           NA         1            1      415000     NA
 4          2022           NA         1            1     1425000     NA
 5          2022           NA         1           32      885000     NA
 6          2022  36091061401         1            1       75000     NA
 7          2022  36003951301         1            1      145000     NA
 8          2022  36013037400         1            1       75000     NA
 9          2022  36075020400         1            1      125000     NA
10          2022  36001014803         1            1      165000     NA
# ℹ 548,895 more rows
# ℹ 7 more variables: property_value <chr>, `applicant_race-1` <dbl>,
#   `co-applicant_r

In [7]:
%%R

# Select only Asian data and put into mortgage_data_asian
# applicant_race-1 equals to 2 or 21 or 22 or 23 or 24 or 25 or 26 or 27
# co-applicant_race-1 equals to 2 or 21 or 22 or 23 or 24 or 25 or 26 or 27 or 7 or 8

mortgage_data_asian <- mortgage_data %>% 
  filter(`applicant_race-1` %in% c(2, 21, 22, 23, 24, 25, 26, 27) |
         `co-applicant_race-1` %in% c(2, 21, 22, 23, 24, 25, 26, 27, 7, 8))

## 👉 convert addresses --> lat/long 

See the [census-examples](https://github.com/data4news/census-examples) repository for examples. If you need help, try asking in the class slack channel. Chances are someone in the class is struggling with the same problem as you are so we might as well all learn together in the same slack channel! 

## 👉 convert lat/long to census geography codes 

(like 'GEOID', 'STATE', 'COUNTY', 'TRACT', 'BLOCK', etc...)

Same note as above, see [census-examples](https://github.com/data4news/census-examples) repository for examples or ask in the class slack channel if stuck.

## 👉 Output Data

Output your dataframe containing your data and the Census connector codes (like tract, block, etc...).

(The data set I use already has census tract data)

In [8]:
%%R 

#Filter out only purchase loans, loan_purpose equals to 1

mortgage_data_asian <- mortgage_data_asian %>% filter(loan_purpose == 1)

mortgage_data_asian

# A tibble: 175,125 × 13
   activity_year census_tract loan_type loan_purpose loan_amount income
           <dbl>        <dbl>     <dbl>        <dbl>       <dbl>  <dbl>
 1          2022           NA         1            1      215000     NA
 2          2022           NA         1            1      415000     NA
 3          2022           NA         1            1     1425000     NA
 4          2022  36091061401         1            1       75000     NA
 5          2022  36003951301         1            1      145000     NA
 6          2022  36013037400         1            1       75000     NA
 7          2022  36075020400         1            1      125000     NA
 8          2022  36001014803         1            1      165000     NA
 9          2022  36063024001         1            1      325000     NA
10          2022  36111952800         1            1      175000     NA
# ℹ 175,115 more rows
# ℹ 7 more variables: property_value <chr>, `applicant_race-1` <dbl>,
#   `co-applicant_r

In [9]:
%%R

# Save the mortgage_data_asian into a csv file

write_csv(mortgage_data_asian, 'mortgage_data_asian.csv')