# Module 10 - Organizing and Analyzing Data with NumPy and Pandas


**_Author: Jessica Cervi_**


**Expected time = 2.5 hours**

**Total points = 100 points**



    
## Assignment Overview

In this assignment, we will use two important packages for data science with Python: NumPy and Pandas. With NumPy, you will conduct some more advanced operations such as matrix multiplication and transposition. With Pandas, you will learn how to work with the Pandas datareader to create DataFrames from remote data.


This assignment is designed to build your familiarity and comfort coding in Python while also helping you review key topics from each module. As you progress through the assignment, answers will get increasingly complex. It is important that you adopt a data scientist's mindset when completing this assignment. **Remember to run your code from each cell before submitting your assignment.** Running your code beforehand will notify you of errors and give you a chance to fix your errors before submitting. You should view your Vocareum submission as if you are delivering a final project to your manager or client. 

***Vocareum Tips***
- Do not add arguments or options to functions unless you are specifically asked to. This will cause an error in Vocareum.
- Do not use a library unless you are expicitly asked to in the question. 
- You can download the Grading Report after submitting the assignment. This will include feedback and hints on incorrect questions. 


### Learning Objectives

- Import data using Python
- Describe common Python packages
- Conduct simple statistical analysis on your data in Python 								


## Index: 

#### Module 10: Organizing and Analyzing Data with NumPy and Pandas

- [Question 1](#q1)
- [Question 2](#q2)
- [Question 3](#q3)
- [Question 4](#q4)
- [Question 5](#q5)
- [Question 6](#q6)
- [Question 7](#q7)
- [Question 8](#q8)
- [Question 9](#q9)
- [Question 10](#q10)
- [Question 11](#q11)
- [Question 12](#q12)
- [Question 13](#q13)
- [Question 14](#q14)
- [Question 15](#q15)
- [Question 16](#q16)
- [Question 17](#q17)
- [Question 18](#q18)
- [Question 19](#q19)

## Module 10: Organizing and Analyzing Data with NumPy and Pandas

We will begin to explore some specific applications using basic `NumPy` and `Pandas` functionality, to help us prepare for modeling work later in the course.  We will work with financial data using `NumPy` to perform basic operations and use `Pandas` for data manipulation. 

### Creating an `array()`

To begin, let's review the basics with `NumPy`.  This includes working with the `array` object, so let's begin by creating an array to represent the values from the data table below.

| RGSE | TAN | CMG | FLSR |
| ----- | ----- | ----- | ----- |
| 0.34 | 1.2 | 4.3 | 2.2 |
| 0.35 | 1.2 | 4.5 | 1.8 | 
| 0.38 | 1.2 | 4.8 | 1.7 |
| 0.35 | 1.1 | 4.9 | 1.2 | 

Remember that `NumPy` arrays support element-wise operations that can be performed via the functions `.sum()`, `.mean()` and so on.

Next, we import the `NumPy` library.

In [1]:
import numpy as np

[Back to top](#Index:) 
<a id='q1'></a>

### Question 1:

*5 points*
    
Using `np.array()`, create a `NumPy` array to match the one in the example above. Save your answer to `ans1` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans1 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q2'></a>

### Question 2:

*5 points*

Find the sum of the entries in the array `ans1` from Question 1. Assign the result to `ans2` as a float.
    

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans2 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q3'></a>

### Question 3:

*5 points*

Find the sum of the columns in the array `ans1` from Question 1 by setting the argument `axis=1`. Save your solution to `ans3` below.
    

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans3 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q4'></a>

### Question 4:

*5 points*

Find the sum of the rows in the array `ans1` from Question 1 by setting the argument `axis=0`. Save your solution to `ans4` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans4 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q5'></a>

### Question 5:

*5 points*

Using the logic from the previous questions,determine the mean of each column in the array `ans1` from Question 1 by using the .mean() method and axis argument. Save your results to `ans5` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans5 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q6'></a>

### Question 6:

*5 points*

Using the logic from the previous questions, determine the standard deviation of each column in the array `ans1` from Question 1 by using the .mean() method and axis argument. Save your results to `ans6` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans6 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Array Indexing

Indexing can be done in `NumPy` by using an array as an index. Index arrays must be made up of the integer data type. Each value in the array indicates which array value to use in place of the index. 

[Back to top](#Index:) 
<a id='q7'></a>

### Question 7:

*5 points*

Slice the last row of  the array `ans1` from Question 1 using the appropriate indicies. Save your results to `ans7` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans7 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q8'></a>

### Question 8:

*5 points*

Multiply the array `ans1` from Question 1 by 1.03. Save the resulting array to `ans8` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans8 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q9'></a>

## Identity Matrix
An identity matrix is a matrix with diagonal element equal to 1 and zero otherwise.

### Question 9:

*5 points*

Using the function `np.identity()`, build an identity matrix with the same dimensions as array `ans1` from Question 1. Save your matrix to ans9 below.


In [None]:
### GRADED

### YOUR SOLUTION HERE
ans9 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q10'></a>

## Transpose

The transpose of a matrix is a new matrix whose rows are the columns of the original and whose columns of the new matrix are the rows of the original.

### Question 10:

*5 points*

Use the array `ans1` from Question 1 and find find its transpose using the `.T` attribute.  Save your transposed array to `ans10` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans10 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q11'></a>

## Dot product

Algebraically, the dot product is the sum of the products of the corresponding entries of the two sequences of numbers (arrays). 

### Question 11:

*5 points*

Save the results of the dot product of the array `ans1` from Question 1 with the identity of itself to `ans11` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans11 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Pandas and Time Series

In this section, we use Pandas to examine a `dataframe` of stock prices in a manner similar to that introduced in the lectures. 

We begin by importing the `pandas` library.

In [None]:
import pandas as pd

Net, we read the database and we visualize the first five rows.

In [None]:
stocks = pd.read_csv('data/tickers.csv', index_col=0)

In [None]:
stocks.head()

[Back to top](#Index:) 
<a id='q12'></a>

### Question 12:

*5 points*

Use the attribute `.describe` to examine the descriptive statistics of the DataFrame `stocks`.  Save your answer as a dataframe to `ans12` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans12= None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q13'></a>

### Question 13:

*5 points*

Use the attribute `.memory_usage` to examine the information about the features of the dataframe `stocks`.  Save your answer as a dataframe to `ans13` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans13= None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q14'></a>

### Question 14:

*5 points*

Use the attribute `.pct_change()` to generate a new dataframe with the percent change for each ticker. Save your answer as dataframe to `ans14` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans14= None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q15'></a>

### Question 15:

*5 points*


The default period in our `.pct_change()` method is one day. This shows the percentage change in one day. We want a broader window, so we need to change this to a weekly (5 days) percent change by entering an appropriate period value.

Generate a dataframe representing the weekly percent change in price data. Save your results as a dataframe to `ans15` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans15= None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q16'></a>

### Question 16:

*5 points*

In Question 15, we changed the time window for `.pct_change()`. To build off of that work, we can use the `.rolling()` method on a `dataframe` to group the data based on this time window.  We then perform an aggregate operation on the data.  For example, we can compute the rolling mean within a 20 day window. Note that this means 20 market days, not including weekends or holidays when the market is closed.

Use the `.rolling()` method together with the `.mean()` method to generate a dataframe with the rolling means for each
ticker.  Save your answer as a dataframe to `ans16` below

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans16= None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q17'></a>

### Question 17:

*10 points*

Now, let's combine our percent change calculation to create a new column that identifies whether or not a stock had a positive or negative percent change.  You must first write a function called `up_or_down` that takes in a dataframe object and returns a dataframe object whose values are a count of the periods, by column, of positive percent change.

For example, if a stock `tan` had 222 days of positive percent change, we'd see a returned Series object that may look like this:

```
cmg     111
tan     222
flsr    333
rgse    444
dtype: int64
```

Save _your function_ to `ans17`. Note that functions can be saved to variables in python. Your variable will subsequently be called by a testing utility, using the dataframe input, to verify the output.

In [None]:
### GRADED

### YOUR SOLUTION HERE
def up_or_down(df):
    import pandas as pd
    '''
    This function takes in a value
    and returns a 0 if the value is
    negative and a 1 where the value is
    positive.
    '''
    return

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q18'></a>

### Question 18:

*5 points*

Examine the correlations between percent change in ticker symbols using the `.corr()` method on the dataframe `stocks`.  Save your results as a dataframe to `ans18` below.

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans18= None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


[Back to top](#Index:) 
<a id='q19'></a>

### Question 19:

Using the assumption that correlated stocks are not good to have in a single portfolio, determine whether there are any symbol(s) you should drop if you own **tan**.

Using the `.corr()` dataframe on percent change, determine  if any symbols correlation with tan is above 0.5.  Save the symbols as set of strings below, i.e. {'cmg', 'flsr'}

In [None]:
### GRADED

### YOUR SOLUTION HERE
ans19 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Linear Regression using `statsmodels.api`

Next, let's try some regression using the `statsmodels.api` implementation of Linear Regression. To begin, we import the necessary library and will save our `X` and `y` variables.  

In [None]:
import statsmodels.api as sm

X= pd.read_csv('data/tickers.csv', index_col=0).pct_change().dropna().drop('tan', axis=1)
y = pd.read_csv('data/tickers.csv', index_col=0).pct_change().dropna()['tan']

Now, we prepare our input features by adding a constant term for the intercept term.  We do this using the `.add_constant()` method.

In [None]:

X_const = sm.add_constant(X)

After adding our constant term to our feature matrix, we are ready to fit a model.  With statsmodels, this is a two step process.  First, we instantiate an empty model class with `sm.OLS()`.  

In [None]:
model = sm.regression.linear_model.OLS(y_ans, X_ans)


Now that we've instantiate a model, we can fit the model using the `.fit()` method.  Once fit, you can see the summary of the model using the `.summary()` method of the variable you saved the fit model to.

```python
model = sm.OLS(y, X)  # Note that y comes first!
res = model.fit()
```

In [None]:
res = model.fit()



Finally, we can examine the confidence intervals for each of our coefficients.  We are interested in whether or not 0 appears inside of the interval.  If so, we fail to reject the null hypothesis that the coefficient is zero.


In [None]:
res_ans = model.fit()