# Cogs 109: Modeling and Data Analysis
## Final project guidelines, 2019

Work in teams of at least 2 and no more than 4 students. Every student in the group will be expected to contribute substantially to the final product(s), and all students should be able to understand and explain all aspects of the project when you present your work in the final symposium.

Your project should. 
- Identify a real problem, challenge or scientific question which could benefit from data analysis and modeling. Your final report must explain why the question is interesting or important. 
- Identify a relevant data set. You should learn about how the data was collected and be able to explain key features of the data, for example: How many observations? What are the noise sources? What are the relevant predictors?
Identify at least one relevant data analysis approach, choosing from the methods covered in the course (linear or nonlinear regression, classification, clustering, PCA, etc.). Explain why this analysis approach is appropriate for addressing your question.
- Identify and explain one or more hypotheses or initial expectation that you will test using the data.
- Model selection: You should compare and contrast multiple different models (at least 2, but usually more). Your comparison should make use of cross-validation, bootstrap sampling, regularization, and/or other relevant techniques. For example, you might compare K-Nearest Neighbors classification for a range of k values (k=1,2,…,50), and select the k value that provides the lowest test set (cross-validation) error.
- Model estimation: Implement your data analysis and present the results using a combination of data visualizations (box plots, scatter plots), statistical analyses and models.
- Present your conclusions and outlook for next steps/future directions.

The final product will be a written report, 5-10 pages in length. In addition, you will create a poster explaining your project to be presented in a symposium session on the last day of class. We will provide more information about the final paper and poster in a few weeks.


## Written report:
Your final report must include the following sections (use these headings).
- Introduction. 
    - Define the real problem and explain its motivation
    - Identify the dataset you will use and explain its key characteristics.
    - Explain at least one hypothesis that you will test.
- Methods. Identify the data analysis approach you will use and explain the rationale/motivation for your choice of this approach.
- Results
    - Model selection. You MUST compare at least 2 models, using cross-validation, regularization, and/or other relevant techniques.
    - Model estimation. What are the final parameter estimates? What is the final accuracy of the model’s predictions?
    - Conclusions and discussion. What can you conclude about your hypothesis? (Note that negative or ambiguous results are perfectly acceptable, you just need to explain what you found.) What are some potential implications/next steps for researchers interested in this topic?


In [None]:
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt
import pytrends

In [None]:
companies = ['tesla', 'facebook', 'microsoft', 'amazon', 'google', 'uber', 'lyft', 'apple', 'snap']
key_terms = ['report', 'good', 'bad', 'up', 'down', 'stock']
company_symbol = ['TSLA']#, 'FB', 'MSFT', 'AMZN', 'GOOGL', 'UBER', 'LYFT', 'AAPL', 'SNAP']

In [None]:
#create kw_list 
kw_list = []
for c_name in companies:
    for k in key_terms:
        kw_list.append(c_name + " " + k)

In [None]:
df = pd.DataFrame()
print(df.empty)
pytrends = TrendReq(hl='en-US', tz=360)
for kw in kw_list:
    pytrends.build_payload([kw], cat=0, timeframe='today 3-m', geo='', gprop='')
    df_temp = pytrends.interest_over_time()
    df_temp = df_temp.drop(['isPartial'], axis=1)
    print(df_temp.columns)
    if df.empty:
        df = df_temp
    else:
        df = df.join(df_temp)

In [None]:
print(df.shape)

In [None]:
# get tsla stock for last 7 days
ts = 'TIME_SERIES_DAILY'
api_key = ''
outputsize = 'compact'
df_stocks = {}
for i, symbol in enumerate(company_symbol):
    link = 'https://www.alphavantage.co/query?function={}&symbol={}&apikey={}&outputsize={}'\
            .format(ts, symbol, api_key, outputsize)
    r = requests.get(link)
    data = json.loads(r.text)
    stock_data_per_day = json.dumps(data["Time Series (Daily)"])
    df_temp = pd.read_json(stock_data_per_day).transpose()
    df_temp.reset_index(level=0, inplace=True)
    df_temp.columns = ['times', 'open', 'high', 'low', 'close', 'volume']
    df_stocks[companies[i]] = df_temp
print(df_stocks['tesla'])

In [None]:
df.head()

In [None]:
df_stocks['tesla'].head()

In [None]:
#reverse df rows
df = df.iloc[::-1]
df.head()

In [None]:
tesla_names = [x for x in list(df.columns.values) if 'tesla' in x]
df_tesla_trends = df[tesla_names]

In [None]:
stock_times = df_stocks['tesla'].times
trends_times = list(df_tesla_trends.index)
joint_times = list(set(stock_times) & set(trends_times)) 

In [None]:
df_stocks['tesla'] = df_stocks['tesla'].loc[df_stocks['tesla']['times'].isin(joint_times)]
df_stocks['tesla'] = df_stocks['tesla'].reset_index()
print(df_stocks['tesla'].head())
df_tesla_trends = df_tesla_trends.loc[df_tesla_trends.index.isin(joint_times)]
df_tesla_trends = df_tesla_trends.reset_index()
df_tesla_trends.columns = ['_'.join(x.split()) for x in list(df_tesla_trends.columns) if len(x) > 1]
print(df_tesla_trends.head())

In [None]:
df_tesla = df_tesla_trends.join(df_stocks['tesla'])

In [None]:
df_tesla['profit'] = df_tesla['open']-df_tesla['close']
df_tesla = df_tesla.iloc[::-1]
df_tesla.head()

In [None]:
# Split into training and testing data
df_tesla_train = df_tesla[:50]
print(df_tesla_train.shape)
df_tesla_test = df_tesla[50:]
print(df_tesla_test.shape)

In [None]:
from sklearn.metrics import mean_squared_error
import statsmodels.api as sm
from sklearn.linear_model import Ridge
import numpy as np

In [None]:
mdl=sm.formula.ols(formula='profit ~ 1 + tesla_report + tesla_good + tesla_bad + tesla_up + tesla_down', data=df_tesla_train).fit()
mdl.summary()

In [None]:
mdl=sm.formula.ols(formula='close ~ 1 + tesla_report + tesla_good + tesla_bad + tesla_up + tesla_down', data=df_tesla_train).fit()
mdl.summary()

In [None]:
cols = ['tesla_report', 'tesla_good', 'tesla_bad', 'tesla_up', 'tesla_down']

In [None]:
#ridge regression
X = df_tesla_train[cols]
y = df_tesla_train['profit']
alpha = []
MSE_train = []
MSE_test = []
for i in range(90, 1000, 10):
    clf = Ridge(alpha=i)
    clf.fit(X, y) 
    print("alpha: "+str(i))
    print("Training error = "+str(mean_squared_error(clf.predict(df_tesla_train[cols]), df_tesla_train['profit'])))
    print("Testing error = "+str(mean_squared_error(clf.predict(df_tesla_test[cols]), df_tesla_test['profit'])))
    print()
    alpha.append(i)
    MSE_train.append(mean_squared_error(clf.predict(df_tesla_train[cols]), df_tesla_train['profit']))
    MSE_test.append(mean_squared_error(clf.predict(df_tesla_test[cols]), df_tesla_test['profit']))

In [None]:
#ridge regression
X = df_tesla_train[cols]
y = df_tesla_train['close']
clf = Ridge(alpha=20000.0)
clf.fit(X, y) 
print("Training error = "+str(mean_squared_error(clf.predict(df_tesla_train[cols]), df_tesla_train['close'])))
print("Testing error = "+str(mean_squared_error(clf.predict(df_tesla_test[cols]), df_tesla_test['close'])))