In [2]:
# Load Libraries

library('gmodels')
library('dplyr')
library('tidyr')

In [5]:
# Load data

loans = read.csv('/Users/britfathi/Downloads/loans.csv')

# Part 1: Does the term of the loan influence loan status? If so, how? Running an independent Chi-Squared

In [7]:
CrossTable(loans$loan_status, loans$term, chisq=TRUE, expected = TRUE, sresid = TRUE, format = "SPSS")


   Cell Contents
|-------------------------|
|                   Count |
|         Expected Values |
| Chi-square contribution |
|             Row Percent |
|          Column Percent |
|           Total Percent |
|            Std Residual |
|-------------------------|

Total Observations in Table:  21957 

                  | loans$term 
loans$loan_status |  36 months  |  60 months  |  Row Total | 
------------------|------------|------------|------------|
      Charged Off |      2029  |      1253  |      3282  | 
                  |  2540.011  |   741.989  |            | 
                  |   102.808  |   351.936  |            | 
                  |    61.822% |    38.178% |    14.947% | 
                  |    11.940% |    25.242% |            | 
                  |     9.241% |     5.707% |            | 
                  |   -10.139  |    18.760  |            | 
------------------|------------|------------|------------|
          Current |         0  |       502  |       502  | 

#### p value and expected values all are within the requirements to proceed. All std resid's are greater than abs(2) , meaning they are all significant. For charged off, the 36 month term is greatly below the mean, while the 60 month is greatly above the mean. For Current, 36 month is greatly below the mean, while 60 month is greatly above the mean. For Fully Paid, 36 month is greatly above the mean, while 60 month is greatly below the mean.

### 36 month term loans look more likely to be paid back, while 60 month is more likely to be charged off. 

# Part 2: How has the ability to own a home changed after 2009. McNemar 

#### McNemar since we have two points in time looking for change between them. Data wrangling needed for the time. 

In [8]:
loans$DateR <- as.Date(paste(loans$Date), "%m/%d/%Y")

loans1 <- separate(loans, DateR, c("Year", "Month", "Day"), sep="-")

loans1$YearR <- NA
loans1$YearR[loans1$Year <= 2009] <- 0
loans1$YearR[loans1$Year > 2010] <- 1


In [9]:
CrossTable(loans1$YearR, loans1$home_ownership, chisq=TRUE, 
           mcnemar=TRUE, fisher = TRUE, expected = TRUE, sresid = TRUE, format = "SPSS")


   Cell Contents
|-------------------------|
|                   Count |
|         Expected Values |
| Chi-square contribution |
|             Row Percent |
|          Column Percent |
|           Total Percent |
|            Std Residual |
|-------------------------|

Total Observations in Table:  15498 

             | loans1$home_ownership 
loans1$YearR |      OWN  |     RENT  | Row Total | 
-------------|-----------|-----------|-----------|
           0 |      550  |     3408  |     3958  | 
             |  551.893  | 3406.107  |           | 
             |    0.006  |    0.001  |           | 
             |   13.896% |   86.104% |   25.539% | 
             |   25.451% |   25.553% |           | 
             |    3.549% |   21.990% |           | 
             |   -0.081  |    0.032  |           | 
-------------|-----------|-----------|-----------|
           1 |     1611  |     9929  |    11540  | 
             | 1609.107  | 9930.893  |           | 
             |    0.002  |    0

#### With the p-value for McNemar's Chi-squared test, the p value is less than .05, so the test is significant. According to the row and column percents, the percent renting to owning remained roughly the same whether you look before or after 2009 with about 14% owning and about 86% renting. 
### Renting or owning before and after 2009 is similar. 

# Part 3: Based on the news story, does it seem likely that the data for this hands on came from the larger population of America?  (will be doing a Goodness of Fit Chi-Square )

In [11]:
loans %>% group_by(loan_status) %>% summarize(count=n())

loan_status,count
<chr>,<int>
Charged Off,3282
Current,502
Fully Paid,18173


In [12]:
observed = c(3382, 502, 18173)

In [13]:
expected = c(0.1, 0.75, 0.15)

In [14]:
chisq.test(x = observed, p = expected)


	Chi-squared test for given probabilities

data:  observed
X-squared = 82963, df = 2, p-value < 2.2e-16


## The p-value shows that there is significant difference between our sample and the news story population. 