# Homeworks - Big Data and Public Policy Class

The objective of the following homework is to **bring to practice the tools studied in class** by conducting step by step a first big data project. We advise that you do it in parallel with the lectures: the relation between the project steps and the lectures is indicated below. 

You will build several **machine learning models that will predict continuous and categorical variables based on data coming from webscraping and/or an API**. 

We propose that you predict the **daily stock price data of a large company** (of your choice) **based on the company's tweets**. You can chose a company from one of the [major stock indices](https://www.wikiwand.com/en/List_of_stock_market_indices).

**Felix' Proposal:**
- predict stock prices of a selected company or aggregate stock index based on tweets related to COVID-19.
- Companies:
    - Constellation brands which includes Corona (as a gag)
    - tourism related companies
    - major company producing "atemschutzmasken" -> !

**MMM**

"Best known as the maker of Scotch tape and Post-It notes, 3M (MMM) also happens to be one of the largest producers of N95 respirators, the type of mask that more efficiently protects people against the virus than ordinary medical masks. Coronavirus has caused a worldwide mask shortage. N95 respirators and regular surgical masks have been unavailable on all major e-commerce platforms in the United States and China since early this year.
This would not be the first time that 3M benefited from a global health crisis. When the SARS epidemic hit in 2003, 3M's sales growth shot up amid increased demand for its respirators, according to the Melius report.
"The 2002 impact from SARS was highly beneficial to 3M," said Scott Davis, a co-author of the report, told CNN Business. While the company didn't disclose any specifics at the time, "it was meaningful and helped the stock to outperform in that period."
Basic medical masks provide a barrier from particulate matter, but do not seal tightly enough against the wearer's face to eliminate the risk of contracting the virus. Worn properly, the N95 mask can filter out about 95% of small airborne particles, according to Christiana Coyle, an expert in epidemics at New York University."
([See article](https://edition.cnn.com/2020/02/27/business/3m-coronavirus-hedge/index.html)).

TODO:
- fetch stock price data for 3M as y
- fetch tweets about coronovirus and respirators as x

In [1]:
# Imports
import os
import re
import time
import requests

import numpy as np
import pandas as pd
import seaborn as sns
import sklearn

import tweepy
from tweepy import OAuthHandler
from bs4 import BeautifulSoup
from IPython.display import display
from pprint import pprint

# Display options
pd.options.display.max_columns = 10
pd.options.display.max_rows = 50

Twitter API

In [2]:
# Credentials
api_key = "gM6qF3EooLId1hTO7jciMugXJ" # API key
api_key_secret = "4WNbvQj9aZ9jvFMXtxtpe0EXkfrMP3W8EYDwN7Y18UqGVQQGwC" # API secret key
access_token = "1229695971816288256-KjzKOYMJkY5WqbAYFFqg7OgzyhNcEj"
access_token_secret = "mmNTwozBE1XQuIM4n3Fg3qEtzanebi9ibb8tWIaE6x2mA"

# Authentification
auth = OAuthHandler(api_key, api_key_secret) #creating an OAuthHandler instance
auth.set_access_token(access_token, access_token_secret) # set tokens
api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)# create api connection

# test authentication
try:
    api.verify_credentials()
    print("Authentication OK")
except:
    print("Error during authentication")

Authentication OK


---------------------------
After having decided the setting (your own or the company + corresponding timespan) that you will study, you should follow the following steps:

**Step 1a: `X` variables** [week 2]
- Fetch the data using the **twitter API** or any other API or website that you are interested in.
- Beware of the rate limits and organize your program so as to overcome them if needed
- The data should include some text, but might also have other interesting variables (retweets, favorites...)
- Create some (non-text based) `X_num` variables that you will use for the prediction

In [3]:
MMM_info = api.get_user(screen_name="3M")
pprint(MMM_info._json)

{'contributors_enabled': False,
 'created_at': 'Thu Sep 22 20:12:19 +0000 2011',
 'default_profile': False,
 'default_profile_image': False,
 'description': 'Here, we innovate with purpose & use #science every day to '
                'create real impact in every life around the world. '
                '#LifeWith3M',
 'entities': {'description': {'urls': []},
              'url': {'urls': [{'display_url': '3M.com',
                                'expanded_url': 'http://www.3M.com',
                                'indices': [0, 22],
                                'url': 'http://t.co/kRd2k9CCzD'}]}},
 'favourites_count': 19542,
 'follow_request_sent': False,
 'followers_count': 1422186,
 'following': False,
 'friends_count': 5006,
 'geo_enabled': True,
 'has_extended_profile': False,
 'id': 378197959,
 'id_str': '378197959',
 'is_translation_enabled': False,
 'is_translator': False,
 'lang': None,
 'listed_count': 1153,
 'location': 'St Paul, MN',
 'name': '3M',
 'notifications': Fal

In [4]:
# Fetch data from 3M twitter account
target_user = "3M"
search_tags = ['coronavirus', 'Coronavirus', 'Covid_19']
n_tweets = 10

#timeline = api.user_timeline(screen_name="3M", count=count, include_rts = True)
#tweets = {}
#for i, tweet in enumerate(timeline):
#    tweets[i] = tweet.text
#pd.DataFrame(tweets, index=["tweet"]).transpose()

# init empty df
df_tweets = pd.DataFrame()

# try the following
try:
    # Fetch nb_tweets_by_target for target
    #timeline = api.user_timeline(screen_name=target_user, count=n_tweets, include_rts = True)
    
    # Put the tweets into a dataframe object
    tweet_count=0
    for tweet in tweepy.Cursor(api.user_timeline, screen_name='@3M', tweet_mode='extended').items():
        # check if a tweet has a hashtag
        if len(tweet._json['entities']['hashtags']) > 0:
            # check if the tweet contains hashtags specified in "search_tags"
            tweet_hashtags = [hashtag['text'] for hashtag in tweet._json['entities']['hashtags'] if hashtag['text'] in search_tags]
            if(len(tweet_hashtags)>0):
                
                # 1. Transform the json into a dataframe
                df_tweet = pd.DataFrame.from_dict(tweet._json, orient='index', columns=[tweet_count]) # , sleep_on_rate_limit=True

                # 2. adds screen name as a row
                #df_tweet=df_tweet.append(pd.DataFrame({tweet_count:[target]}, index=['twitter_handle']))

                # 3. Add the tweet dataframe to the df_tweets dataframe
                df_tweets=pd.concat([df_tweet, df_tweets], axis=1)

                # counting the number of target fetched
                tweet_count += 1 

    time.sleep(0.5)

# except if TweepError arises
except tweepy.TweepError: #the error arises when the user has protected tweets
    print("Failed to run the command on user {}, Skipping...".format(target_user))

# except if RateLimitError arises
except tweepy.RateLimitError:
    print("resource usage limit: {} skipped".format(target_user))
    time.sleep(0.3)
    
df_tweets = df_tweets.transpose()
print(df_tweets)

                       created_at                   id               id_str  \
2  Fri Jan 31 20:36:25 +0000 2020  1223344303193427968  1223344303193427968   
1  Fri Feb 28 22:52:57 +0000 2020  1233525525446086658  1233525525446086658   
0  Mon Mar 02 23:38:05 +0000 2020  1234624044760227840  1234624044760227840   

                                           full_text truncated  ...  \
2  Due to #coronavirus, we are receiving an incre...     False  ...   
1  .@CNBC reporter @seemacnbc goes behind-the-sce...     False  ...   
0  We are grateful for the efforts of 3M employee...     False  ...   

  favorite_count favorited retweeted possibly_sensitive lang  
2            136     False     False              False   en  
1             58     False     False              False   en  
0             59     False     False              False   en  

[3 rows x 26 columns]


Inspect the scraped data set

In [9]:
# try auto-inferring dtypes to cast columns from dtype "object" to more appropriate and memory-efficient dtypes
df_tweets = df_tweets.infer_objects()

# manually deal with the remaining column dtypes
# TODO: deal with remainings
df_tweets['created_at'] = pd.to_datetime(df_tweets['created_at'])

# set datetime index
df_tweets = df_tweets.set_index('created_at')

KeyError: 'created_at'

In [13]:
df_tweets.to_csv("./3M_covid19_tweets.csv")

**Step 1b: continuous `y` variables** [week 2]
- Fetching the data: 
    - if you work on the suggested idea, you can easily access daily stock prices using the [`yfinance` package](https://pypi.org/project/yfinance/)  (see below)
    - otherwise, you can find some interesting data listed in the syllabus

**Step 1c: merge `X_num` and `y`** [week 2]
- Beware of the temporality: in the case of the proposed study on stock market prices, you will have to deal with the fact that the X is at the tweet level while `y` is daily. 

***Milestone 1 - March 24th*** *You can submit the previous steps as a first notebook.*

**Sample split.** Do the standard 80% / 20% training/test split using all days in the data. In addition, do a separate temporal split where the training set is the first 80% of days in the time series.

**For all machine learning models**, report performance measure in test and train samples.

**Step 2a: estimate different regression models using `X_num` and `y`** [week 3]

***Milestone 2 - March 31th***  *You can submit the previous steps as a second notebook.*

**Step 3: text analysis** [week 4]
- Featurize tweets (or another text dataset related to your subject): transform the text into a standard document-level dataset `X_doc`

**Step 2b: estimate different regression models using `X_doc` and `y`** [week 3]

**Step 4: estimate a classification models** [week 5]
- propose a categorical variable `y_calc` that you can compute from the continuous one (`y`) (e.g. positive or negative growthin stock prices). For the `X` dimension, you can use `X_doc` or `X_num` or both. 
- you can use any other categorical variable that you find relevant 

***Milestone 3 - April 21th***  *You can submit the previous steps as a third notebook.*

**Step 6: Dimension reduction** [week 6]
- Use one of the dimension reductions methods to dimension-reduce the features
    - PCA or topic model (LDA or STM) or k-means clustering on the featurized text `X_doc`
- Run another classifier

**Step bonus: Econometric identification** [week 8]
- Find an exogenous shock affecting this firm (but not all the firms) and a control group of firms not affected
    - example: a natural disaster/shock to the exchange rate/change in ownership... affects the functionning of this firm but not the other firms of the stock market index.  
- scale up the previous data collection to the firms in the control group
- use one of the technique studied in class to causaly identify the impact of the exogenous shock on the stock market of the affected firm

***Milestone 4 - May 17th*** *You can submit the previous steps as a last notebook.*

---------------------------
**Requirement for completion grade**: 
- the homework should be done in **groups of 1 or 2 persons** 
- the homework will have to be submitted as a (Python or R) **notebooks**
- you can give them back at the indicated **milestone** or on May 17th (no homework will be accepted after May 17th)
- the notebooks should contain at least **three graphs** (overall)
- each notebook should run from the beginning to then end, but:
    - beware of paths to folders that are on your computer but not on mine
    - if you have such a path, you can: 
        - have the data downloaded in the notebook directly
        - import your data in a github (or any other online storage) folder that the notebook can access  
    - indicate if the notebook takes more than 30 minutes to run
- do not write us emails regarding the homework, ask your questions on the **forum**, it will benefit everyone for sure! 

---------------------------

**Alternative homework**

If you prefer, we propose for two persons at most to translate all python notebooks of the class into R. This work would count as your homework for the class. Contact us in advance if you want to do this.

In [7]:
# Import the yfinance. 
# If you get module not found error the run !pip install yfinance from your Jupyter notebook
import yfinance as yf