# Module 7 Assignment

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import scipy.stats as st
from nose.tools import assert_equal, assert_almost_equal
from nose.tools import assert_equal, assert_is_instance, assert_is_not
%matplotlib inline

In the following problems we will use the dow jones index data

In [None]:
df = pd.read_csv('dow_jones_index.data')
df.head()

# Problem 1: Creating a Scatter Plot

Write a function called `scatter_plot` that takes in a stock name, and two other column names from the dow jones index data and plots a scatter plot of the two columns.

For example, the function would be able to take in `'AA'`, `'open'`, and `'close'` as inputs, and it would plot the scatter plot of `'open'` and `'close'` for the `'AA'` stock.

Furthermore:

1. Give the x-axis the same label as the column name inputted for `x_data`.

2. Give the y-axis the same label as the column name inputted for `y_data`.

In [None]:
# stock_name = "AA"

# # Filter the data for the given stock name
# stock_data = df[df['stock'] == stock_name]
# stock_data.head()

In [None]:
# x_data = 'open'
# y_data = 'close'

# # Extract the specified columns
# x = stock_data[x_data]
# y = stock_data[y_data]

In [None]:
# x.head(), y.head()

In [None]:
# Create the scatter plot
# plt.scatter(x, y)
# plt.xlabel(x_data)
# plt.ylabel(y_data)
# plt.title(f"Scatter Plot: {stock_name}")
# plt.show()

In [None]:
def scatter_plot(df,stock_name,x_data,y_data):
    """
    Inputs
    ------
    df: a pandas dataframe, the dataframe containing the relevant data
    
    stock_name: a string, the name of the stock
    
    x_data: a string, the name of the first column to be used
    
    y_data: a string, the name of the second column to be used
    
    Output
    ------
    
    ax: a matplotlib.axes object
    
    """
    
    ### YOUR CODE HERE
    
    # Filter the dataframe for the given stock name
    stock_data = df[df['stock'] == stock_name]

    # Extract the specified columns
    x = stock_data[x_data]
    y = stock_data[y_data]

    # Create the scatter plot
    fig, ax = plt.subplots()
    ax.scatter(x, y)
    ax.set_xlabel(x_data)
    ax.set_ylabel(y_data)
    #ax.set_title(f"Scatter Plot: {stock_name}")
    
    return ax

In [None]:
my_plot = scatter_plot(df, 'AA', 'open', 'close')

In [None]:
assert_equal(my_plot.get_xlabel(), 'open')
assert_equal(my_plot.get_ylabel(), 'close')
assert_almost_equal(my_plot.collections[0].get_offsets()[0][0], 15.82)
assert_equal(len(my_plot.collections[0].get_offsets()), 25)

In [None]:
df_mat = df.values  # convert pandas to numpy multi dim array
print(df_mat[:5])

In [None]:
df_mat

# Problem 2: Correlation of Columns

In this problem you will finish writing the correlation function. The correlation function has the following parameters: `df_mat` a multidimensional array and `col1` and `col2` integer indices used to index `df_mat`. 

Your task is do the following:
- Get `col1` and `col2` from `df_mat`.
- Plot `col1` and `col2` from `df_mat` using the `scatter` function from pyplot.
    - Your plot should have a title and labels for the x and y axis.
- Compute the Pearson and Spearman correlations of `col1` and `col2`.
- Lastly, your function should return:  
    - the `axes` object (we have created this for you)
    - Pearson correlation
    - Spearman correlation   

In [None]:
col1, col2 = 6, 7

In [None]:
#df_mat[:, col1][:5]

In [None]:
#df_mat[:, col2][:5]

In [None]:
# Get col1 and col2 from df_mat
x = df_mat[:, col1]
y = df_mat[:, col2]

In [None]:
# Plot col1 and col2 using scatter plot
fig, ax = plt.subplots()
ax.scatter(x, y)
ax.set_xlabel(f'Column {col1}')
ax.set_ylabel(f'Column {col2}')
ax.set_title('Scatter Plot')
plt.show()

In [None]:
# Compute Pearson and Spearman correlations

pearson_corr, _ = st.pearsonr(x, y)
spearman_corr, _ = st.spearmanr(x, y)

In [None]:
pearson_corr, spearman_corr

In [None]:
pearson_pvalue

In [None]:
print(pearson_corr)

In [None]:
def correlation(df_mat, col1, col2):
    
    ### YOUR CODE HERE
    
    # Get col1 and col2 from df_mat
    x = df_mat[:, col1]
    y = df_mat[:, col2]
    
    
    # Plot col1 and col2 using scatter plot
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.scatter(x, y)
    ax.set_xlabel(f'Column {col1}')
    ax.set_ylabel(f'Column {col2}')
    ax.set_title('Scatter Plot')
    
    # Compute Pearson and Spearman correlations
    pearson_corr, _ = st.pearsonr(x, y)
    spearman_corr, _ = st.spearmanr(x, y)
    
    # Return axes object, Pearson correlation, and Spearman correlation
    return ax, pearson_corr, spearman_corr


In [None]:
from helper import c

col1, col2 = 6, 7
sol = c(df_mat, col1, col2)

ax, pc, sc = correlation(df_mat, col1, col2)
data = ax.collections[0].get_offsets()
print('Pearson Correlation: {0}'.format(pc[0]))
print('Spearman Correlation: {0}'.format(sc[0]))
assert_is_instance(ax, mpl.axes.Axes, msg='Return a Axes object.')  
assert_is_not(len(ax.title.get_text()), 0, msg="Your plot doesn't have a title.")
assert_is_not(ax.xaxis.get_label_text(), '', msg="Change the x-axis label to something more descriptive.")
assert_is_not(ax.yaxis.get_label_text(), '', msg="Change the y-axis label to something more descriptive.")

assert_equal(np.array_equal(data[:,0], sol[0]), True, msg="Data on for the x axis is not correct")
assert_equal(np.array_equal(data[:,1], sol[1]), True, msg="Data on for the y axis is not correct")

assert_almost_equal(pc[0], sol[2][0])
assert_almost_equal(sc[0], sol[3][0])

# Problem 3: Fitting OLS Model to Data

Your task is to finish writing the `reg_plot` function. Your task is to fit an OLS model to 2 columns of data. `reg_plot` takes in the following parameters:

- a **dataframe** (not numpy multidimensional array), 
- `x` and `y`, which are strings that specify the name of a column in the DataFrame. 

Use the `regplot` function from the Seaborn library to fit an ols model to the data. Your plot should contain a label for the x and y axis and also a title.

In [None]:
def reg_plot(df, x, y):
    '''
    df dataframe
    
    x: column name
    
    y: column name
    '''
    
    ### YOUR CODE HERE
    
    return ax

In [None]:
ax = reg_plot(df, x='open', y='close')
from helper import rp
sol_x, sol_y = rp(df)
assert_is_instance(ax, mpl.axes.Axes, msg='Return a Axes object.')  
assert_is_not(len(ax.title.get_text()), 0, msg="Your plot doesn't have a title.")
assert_is_not(ax.xaxis.get_label_text(), '', msg="Change the x-axis label to something more descriptive.")
assert_is_not(ax.yaxis.get_label_text(), '', msg="Change the y-axis label to something more descriptive.")
assert_equal(np.array_equal(ax.lines[0].get_ydata(), sol_y), True, msg="Data on Y-Axis is incorrect")
assert_equal(np.array_equal(ax.lines[0].get_xdata(), sol_x), True, msg="Data on x-axis is incorrect")
sol_x, sol_y = rp(df, 'high', 'volume')
assert_is_instance(ax, mpl.axes.Axes, msg='Return a Axes object.')  
assert_is_not(len(ax.title.get_text()), 0, msg="Your plot doesn't have a title.")
assert_is_not(ax.xaxis.get_label_text(), '', msg="Change the x-axis label to something more descriptive.")
assert_is_not(ax.yaxis.get_label_text(), '', msg="Change the y-axis label to something more descriptive.")
assert_equal(np.array_equal(ax.lines[0].get_ydata(), sol_y), True, msg="Data on x axis is incorrect")
assert_equal(np.array_equal(ax.lines[0].get_xdata(), sol_x), True, msg="Data on y axis is incorrect")

**&copy; 2017: Robert J. Brunner at the University of Illinois.**

This notebook is released under the [Creative Commons license CC BY-NC-SA 4.0][ll]. Any reproduction, adaptation, distribution, dissemination or making available of this notebook for commercial use is not allowed unless authorized in writing by the copyright holder.

[ll]: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode 