<a href="https://colab.research.google.com/github/aureliuszi/LPA2021/blob/master/StarterCode_pilgrimBankA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pilgrim Bank (A) Starter Code

### Purpose:
This jupyter notebook will help you prepare the Pilgrim Bank (A) case for class discussion. Jupyter notebooks are useful for combining normal text with R code and output (or other types of code, such as Python). For example, you can use Jupyter Notebooks when creating reports that contain analyses, or when describing and documenting your analyses so others can understand what you've done.

### Jupyter Notebooks:
Jupyter Notebooks are organized into **cells**. There are two primary cell types in Jupyter: **code cells** and **markdown cells**.      

The text you're reading now is in a **markdown cell**. Markdown cells contain human language text and allow you to intersperse your R code with written commentary. To create a new markdown cell, go to the menu above, click the "+" button, and then use the nearby dropdown menu to switch from "Code" to "Markdown". After entering text in a markdown cell, you can click the "Run" button above to make your text appear nicely formatted and easy to read, just like this cell you're reading now.         

Below you'll also see slightly shaded cells with "In[ ]" to the left of them. These are **code cells**. The "In[ ]" before a code cell tells you this is code input. When you run a code cell, the code's corresponding output will appear directly below your code. You can run a cell by pressing the "Run" button in the menu above or by hitting "Shift + Enter" (Mac) or "Ctrl + Enter" (PC). You can edit and run the code in a code cell as many times as you'd like, and the new output will simply replace the previous output below the code. This feature allows you to try different commands, models, or other analyses and immediately see the output as you go. To add a new code cell, select the "+" button above, which by default adds a code cell. 

You can insert a new cell on any line using any of three methods: 
* click the "+" in the menu above
* click the Insert option in the menu above
* use the keyboard shortcut "Esc + A" or "Esc + B" to insert a cell above or below respectively

If you're interested in learning more about using Jupyter notebook shortcuts, click "Help" in the menu above and navigate to "Keyboard Shortcuts". For additional information, here is a great resource: https://towardsdatascience.com/jypyter-notebook-shortcuts-bf0101a98330

### Class Assignment

To prepare for our class discussion of the Pilgrim Bank (A) case, you should run the code provided in this notebook, prepare to discuss the output, and experiment with any additional analyses that you think will help to inform the issues in the case.

### Summary of Notebook:      

**Section 1: Load data**     
Load Pilgrim Bank (A) data into R.        

**Section 2: Summarize and describe the data**     
Explore the Pilgrim Bank (A) data by performing a series of summary and description commands.         

**Section 3: Visualize the data**    
Use histograms, boxplots and scatterplots to visualize key variables.     

**Section 4: Begin testing relationships between variables**     
Use correlations and t-tests to examine relationships between variables.        

**Section 5: Understanding profit**     
Use linear regression to test the drivers of customer profitability.     

**Section 6: Your own analyses**    
Use this section to try your own analyses to better answer the questions outlined in the Pilgrim Bank A case. 

## Section 1: Load data

Load the Pilgrim Bank (A) data into R as a dataframe named "df_pilgrim" .

In [None]:
df_pilgrim <- read.csv("LPA20_Class4_PilgrimBank_A_Data.csv") 

After running the cell above, a dataframe called df_pilgrim should now be loaded in your R Environment. This dataframe should have 31634 observations of 7 variables. You can check to see the structure of your full dataframe by loading the first 10 rows of the dataframe using the 'head' command. It is best practice in jupyter notebooks to use the 'head' command after you have loaded a new dataframe in order to ensure the dataset has been loaded correctly. 

In [None]:
head(df_pilgrim, 10)

Another useful way to view the structure of the dataframe is with the 'str' command. Note that this output provides the variable type, for example integers and numeric variables (the latter have decimal places). 

In [None]:
str(df_pilgrim)

## Section 2: Summarize and describe the data

Summarize the variables in the dataframe:

In [None]:
summary(df_pilgrim)

### Summarizing Profit

The Pilgrim Bank (A) case asks about customer profitability. Let's get a better understanding of the profit variable using some commands in base R.      

There are many simple commands to quickly describe a single variable in R. Examples include mean, min, max, sd, range, and quantile. To run a command on a single variable, first specify the command, then in parentheses specify the dataframe, then type "$" to connect the dataframe to a subsequent variable, and then type the variable name.        

In general, to reference a single variable in R:     
* dataframe$variable_name

Find the mean of the profit variable:

In [None]:
mean(df_pilgrim$profit99)   

Find the standard deviation of profit:

In [None]:
sd(df_pilgrim$profit99) 

Find the range of the profit variable:

In [None]:
range(df_pilgrim$profit99)

Find the quartiles of the profit variable:

In [None]:
quantile(df_pilgrim$profit99)

## Section 3: Visualize the data
You can begin by examining the distribution of customers for two demographic variables, age and income.

Create a histogram to explore the frequencies of each age category, specifying the main heading for the output and the label for the x-axis:

In [None]:
hist(df_pilgrim$age99, main = "Histogram of Age Buckets", xlab = "Age") 

Recall that the age categories are defined as follows:  
* 1 = less than 15 years
* 2 = 15-24 years
* 3 = 25-34 years
* 4 = 35-44 years
* 5 = 45-54 years 
* 6 = 55-64 years 
* 7 = 65 years and older

Create a histogram of the income categories:

In [None]:
hist(df_pilgrim$income99, main = "Histogram of Income Buckets", xlab = "Income") 

Reminder of the income categories: 
* 1 = less than `$15,000`
* 2 =  `$15,000- $19,999`
* 3 =  `$20,000- $29,999`
* 4 =  `$30,000- $39,999`
* 5 =  `$40,000- $49,999`
* 6 =  `$50,000- $74,999`
* 7 = `$75,000-$99,999`
* 8 = `$100,000-$124,999` 
* 9 = `$125,000 and more`

Histogram to display the frequency distribution of the profit99 observations:

In [None]:
hist(df_pilgrim$profit99, main = "Histogram of Profit for 1999", xlab = "Profit")

You can also adjust the number of breakpoints, which determines how many bars appear in the histogram:

In [None]:
hist(df_pilgrim$profit99, breaks = 20, main = "Histogram of Profit for 1999", xlab = "Profit")

A boxplot is another way to display the distribution of a variable. The box is bounded by the 25th & 75th quartiles, with the median inside the box:

In [None]:
boxplot(df_pilgrim$profit99, horizontal = TRUE)

You can start looking at relationships between two variables with a scatterplot:

In [None]:
plot(df_pilgrim$tenure99, df_pilgrim$profit99, xlab = "Tenure", ylab = "Profit")

You can make the observations appear smaller in the plot. In the command below, cex is optional and stands for "character extension"; the default size is cex=1.

In [None]:
plot(df_pilgrim$tenure99, df_pilgrim$profit99, cex = .2, xlab = "Tenure", ylab = "Profit") 

## Section 4: Begin testing relationships between variables

### Examining the relationship between tenure99 and profit99:

Note that tenure99 is a continuous variable ranging in years. To find the spread of this variable, use the "range" command in R.

In [None]:
range(df_pilgrim$tenure99)

What is the correlation between tenure99 and profit99?

In [None]:
cor.test(df_pilgrim$tenure99, df_pilgrim$profit99)

### Correlation Matrix: Relationship between all Pilgrim Bank variables

Generate a correlation matrix with all variables in the dataframe:

In [None]:
cor(df_pilgrim)

The same correlation matrix, with values rounded to 3 digits:

In [None]:
round(cor(df_pilgrim), digits=3)

### Summarize the data for the online and offline customers separately

How many customers are not online (online=0) versus online (online=1)?

In [None]:
table(df_pilgrim$online99)

What is the mean profit in each of these groups of customers?

In [None]:
mean(df_pilgrim$profit99[df_pilgrim$online99==0])
mean(df_pilgrim$profit99[df_pilgrim$online99==1])

Another way to get mean profitability grouped by online = 1 and not online = 0:

In [None]:
by(df_pilgrim$profit99, df_pilgrim$online99, mean)

Is the difference in profit between these two groups, online and not online, significantly different?

In [None]:
t.test(profit99 ~ online99, data = df_pilgrim, var.equal = TRUE)

To get help within jupyter for any function use a "?" before the command. A description of the command will appear in the bottom half of your jupyter tab.

In [None]:
?t.test

## Section 5: Understanding Profit

Run a linear model (also called a linear regression) to test whether online99 is a significant predictor of profit99. Store the results of the linear model in a new data object called "model_1". You will now be able to see the results of your regression by typing model_1 in a code cell, or you can summarize your model_1 regression results by typing summary(model_1) in a code cell as well. 

Run a linear regression model (using the 'lm' command) to test whether online99 predicts profit99:

In [None]:
model_1 <- lm(profit99 ~ online99, data = df_pilgrim)

To view the summarized output that is now stored in model_1:

In [None]:
summary(model_1)

How does the ouput of the regression above compare to the result of the previous t-test of whether profit99 differs by online99?

To see the confidence intervals for the regression coefficients:

In [None]:
confint(model_1)

What happens when we add tenure99 to the linear model? 

In [None]:
model_2 <- lm(profit99 ~ online99 + tenure99, data = df_pilgrim)
summary(model_2)

## Section 6: Your own analyses         

* How do the results of model_2 compare to model_1?  What is your interpretation?  
* What should Pilgrim Bank do based on these analyses?
* What other models would you run to inform your recommendations?  

You can try other models simply by adding more variables to the linear model above after or instead of tenure99 (e.g., + age99 + income99 + etc.)