# Final Project for INFO 6350
### Project Team: Yun Zhou (yz2685)

# Part 0: Background

**Statement of the problem:** The annual financial reports (10K reports) of the publicly traded companies are usually very lengthy and extensive, few people had the time and energy to read such report across all the companies over the years, and make accurate predications on the companies' financial performances based on the hundreds of reports every year of which the average page number can be up to 100 pages or 30k - 60k words. 

**Impact:** The results from my analysis will deliver meaningful recommendations to not only the NLP scholars, but also to the real-world managers,  executives, and investors. By using text mining techniques, we can greatly improve the efficiency of the decision-making process in the business world, saving executives/managers' time reading their competitors' verbose 10K reports. Additionally, the project result will help the individual investors save time analyzing the 10K reports and make better investment decisions in the stock market.   

# Part 1: Research Questions

### Q1: What are the tones or sentiments used to describe the risks of companies? Are they related to the stock prices? 

### Q2: What are the most important features in the **risk sections** in the 10-k reports that can make predictions or classifications on the companies' stock prices?

### Q3: Given a risk section, can our classifier correctly determine if the record is before-2014 or post-2014?  

# Part 2: Methodologies

**Methodology:** 
* Perform web crawling tasks to collect the financial reports over the years of the companies of interest. 
* Make gold labels based on the companies' financial metrics, such as earnings per share or net profit margin ratio. and label each report as below-average-margin vs. above-average-margin
* Clean the corpus, remove tables, figures, stopwords, etc. 
* Perform sentiment analysis on the corpus to detect the positiveness and/or negativeness of the financial situations of the companies over the years. 
* Build regression models to predict the companies' financial performance and calculate R^2
* Build various classifiers (e.g. random forest, decision tree, SVD, logistic classification, BERT) to classify the companies into below-average and above-average based on the financial metrics. Calculate the F1 and accuracy scores 
* Compare the results from regression models and classifers.
* Make recommendations on how to choose the best stuitable model for such tasks in the future. 

# Part 3: Code

In [None]:
# import libraries
import requests
import urllib
from bs4 import BeautifulSoup
import simplejson as json
from urllib.request import Request, urlopen
from fake_useragent import UserAgent
from selenium import webdriver
import time
import re
import pandas as pd
import json
import datetime
import numpy as np
import unicodedata
from collections import Counter
from nltk import word_tokenize, sent_tokenize
import matplotlib.pyplot as plt
import pickle
from   sklearn.feature_extraction.text import TfidfVectorizer
from   sklearn.feature_selection import SelectKBest, mutual_info_classif
from   sklearn.linear_model import LogisticRegression, LinearRegression
from   sklearn.model_selection import cross_val_score
from   sklearn.preprocessing import StandardScaler
import spacy
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from   sklearn.cluster import KMeans, SpectralClustering, DBSCAN, OPTICS, AgglomerativeClustering
import os
import string
import copy
from   collections import defaultdict
from   nltk.corpus import stopwords
from   sklearn.feature_selection import SelectKBest, mutual_info_regression
from wordcloud import WordCloud 

In [None]:
# define the base url needed to create the file url.
base_url = r"https://www.sec.gov"
header = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36'
}

# four-digit year pattern
yearPattern = re.compile(r'\d{4}$')

In [None]:
list_cik = pd.read_csv('nasdaq100_ticker_cik_mapping.csv').fillna(value = 0)
len(list_cik)

In [None]:
list_cik.head()

## I: Web Scraping

## 1. Scraping the SEC Query Page

In [None]:
# define lists to store the data scraped from the SEC website
ciks = []
risks = []
years = []
urls = []
companies = []
symbols = []

# base URL for the SEC EDGAR browser
endpoint = r"https://www.sec.gov/cgi-bin/browse-edgar"

In [None]:
driver = webdriver.Firefox()

### 1a. Helper functions 

In [None]:
# get the links to the 10k reports 
def get10kPages(url):
    response = requests.get(url = url, headers=header)
    soup10k = BeautifulSoup(response.content, 'html.parser')
    # print(response)
    # print(response.url)
    return response.url

In [None]:
def get10kLinks(url, list_of_10ks):
    response = requests.get(url = url, headers=header)
    soup = BeautifulSoup(response.content, 'html.parser')
    # for a in soup.find_all('a', href=True):
        # url = a['href']
        # print(url)
        
    suffix = "htm";

    tables = soup.find('table')
    rows = tables.find_all('tr')
    if len(rows) > 0:
        row10k = rows[1] # row 1 has link to 10k report
        # print(row10k)
        for a in row10k.find_all('a', href=True):
            url = a['href']
            if url.endswith(suffix):
                list_of_10ks.append("https://www.sec.gov"+ url)
                # print("https://www.sec.gov"+ url)

In [None]:
def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

In [None]:
# create a function that will scrap the 10k report 
def scrap10k(url, cik, company, symbol):
    driver.get(url)

    time.sleep(2) # give browser some time to load the js 

    html = driver.page_source
    sp = BeautifulSoup(html)
    
    text = ""
    for d in sp.find_all(text=True):
        text += d.get_text()
    
    # cleaning 
    # print(text)
    text = text.replace(u'\xa0', u' ').lower()
                
    # extract risk factor sections only
    # print(url)
    
    # get fiscal year     
    yr = ""
    for span in sp.find_all(text=True):
        stext = span.text
        if (stext.find("January") != -1
         or stext.find("February") != -1
         or stext.find("March") != -1
         or stext.find("April") != -1
         or stext.find("May") != -1
         or stext.find("June") != -1
         or stext.find("July") != -1
         or stext.find("August") != -1
         or stext.find("September") != -1
         or stext.find("October") != -1
         or stext.find("November") != -1
         or stext.find("December") != -1):
            stext = stext.strip()
            yr = stext[-4:]
            # print(yr)
            break
            
    
    # print("=================")    
    # print(text)
    
    # proceed only when a valid year is scraped 
    yr_match = re.match(yearPattern, yr)
    if yr_match != None:
        yr = int(yr) # convert string to int
        if yr > 2006:
            start = find_nth(text, "item 1a.", 2)
            # print("start index", start)
            end = find_nth(text, "item 1b.", 2)
            # print("end index", end)
            substring = text[start:end]
            if len(substring) > 100: # only pull longer risk factors 
                risks.append(substring)
                years.append(yr)
                ciks.append(cik)
                urls.append(url)
                companies.append(company)
                symbols.append(symbol)
    #     else:
    #         print("year before 2007: ", yr)
    # else:
    #     print("fiscal year not detected")

In [None]:
def getDataByCIK(cik, company, symbol):
    ########################################
    ### Step 1. Scraping the SEC Query Page
    ########################################
    # define our parameters dictionary
    param_dict = {'action':'getcompany',
                  'CIK': cik,
                  'type':'10-k',
                  'dateb':'20230101',
                  'owner':'exclude',
                  'start':'',
                  'output':'',
                  'count':'100'}

    # request the url, and then parse the response.
    response = requests.get(url = endpoint, params = param_dict, headers=header)
    # response = requests.get(url = endpoint, params = param_dict)
    soup = BeautifulSoup(response.content, 'html.parser')

    # print('Request Successful')
    # print(response.url)
    
    doc_table = soup.find_all(class_ = "blueRow")
    
    data = soup.find_all(class_='blueRow')

    list_10k = []

    for i, row in enumerate(data): 
        for a in data[i].find_all('a', href=True):
            url = a['href']
            if (url.startswith('/Archives/edgar/')):
                list_10k.append("https://www.sec.gov"+ url)
                # print("https://www.sec.gov"+ url)
                
    ########################################
    ### Step 2. Scraping Company Page 
    ########################################
    
    list_of_links = []
    for link in list_10k:
        list_of_links.append(get10kPages(link))
        
    # get the url for 10k report 
    list_of_10ks = []
    for url in list_of_links:
        get10kLinks(url, list_of_10ks)
        

    ########################################
    ### Step 3. Scraping 10k Reports 
    ########################################
    for report in list_of_10ks:
        scrap10k(report, cik, company, symbol)

In [None]:
# test
# scrap10k('https://www.sec.gov/Archives/edgar/data/1018724/000101872419000004/amzn-20181231x10k.htm', "1018724", "amzn")

In [None]:
# test
# getDataByCIK("1067983", "brk-b")

### 1b. Run scripts for all the companies of interest

In [None]:
%%time
for i, row in list_cik.iterrows():
    getDataByCIK(row.cik, row.company, row.symbol)

### 1c. Create Data Frame

In [None]:
# create a data frame
data_tuples = list(zip(ciks, symbols, companies, years, risks, urls))
# data_tuples

df = pd.DataFrame(data_tuples, columns=['cik', 'symbol', 'company', 'fiscal_year', 'risk', 'url'])

In [None]:
df.head()

In [None]:
len(df)

## II. Scrap stock prices from Yahoo Finance 

In [None]:
dict_fi = {}

In [None]:
def getStockPricesByTicker(ticker, fiscal_year, period1, period2):
    url_history = f'https://finance.yahoo.com/quote/{ticker}/history?period1={period1}&period2={period2}&interval=1mo&filter=history&frequency=1mo&includeAdjustedClose=false'
    
    respf = requests.get(url_history, headers=header)
    # print(respf)
    
    soupf = BeautifulSoup(respf.text, 'html.parser')
    
    patternf = re.compile(r'\s--\sData\s--\s')
    script_data = soupf.find('script', text=patternf).contents[0]
    
    start = script_data.find("context")-2
    json_data = json.loads(script_data[start:-12])
    
    # get historical stock prices 
    try:
        HistoricalPriceStore = json_data['context']['dispatcher']['stores']['HistoricalPriceStore']
        # print(HistoricalPriceStore)
        # clean the stock price data 
        close = -99999 #default
        for row in HistoricalPriceStore['prices']:
            date = row['date']
            dt_formatted = datetime.datetime.fromtimestamp(date).date() # using the local timezone

            # get the stock price in December of a given year 
            if dt_formatted.month == 12 and dt_formatted.year == fiscal_year and 'close' in row:
                # debug
                # print(ticker, "=========", fiscal_year, "=====", row['date'])
                # print(HistoricalPriceStore)
                close = row['close']
                break # once find a December record then stop looking

        if dict_fi.get(ticker) == None:
            dict_fi[ticker] = {}

        dict_fi[ticker][fiscal_year] = close
    except:
        return None

### 2b. Run scripts for all the records collected from #1

In [None]:
# %%time

# # test
# for i, row in df.iterrows():
#     if row.symbol == 'AVGO':
#         ticker = row.symbol
#         fiscal_year = row.fiscal_year
#         # period1 = '1167609600' # 2007-01-01
#         # period2 = '1672444800' # 2022-12-30
#         getStockPricesByTicker('AVGO', fiscal_year, '1167609600', '1672444800')

In [None]:
# df[df['symbol'] == 'AVGO']

In [None]:
%%time
for i, row in df.iterrows():
    ticker = row.symbol
    fiscal_year = row.fiscal_year
    # period1 = '1167609600' # 2007-01-01
    # period2 = '1672444800' # 2022-12-30
    getStockPricesByTicker(ticker, fiscal_year, '1167609600', '1672444800')

### 2c. Create data frames for stock prices

In [None]:
len(dict_fi)

In [None]:
df_fi = pd.DataFrame.from_records(
    [
        (level1, level2, leaf)
        for level1, level2_dict in dict_fi.items()
        for level2, leaf in level2_dict.items()
    ],
    columns=['symbol', 'fiscal_year', 'price']
)

In [None]:
df_fi.head()

### 2d. Merge two data frames

In [None]:
merged_df = pd.merge(df, df_fi, how='left', on=['fiscal_year','symbol'])

In [None]:
merged_df

## III: Exploratory Analysis & Data Cleaning

### 0. Helper functions

In [None]:
def word_count(text):
    counter = Counter()
    tokens_nltk = word_tokenize(text)

    for token in tokens_nltk:
        counter[token] += 1
    return sum(counter.values())

### 1. Exclude invalid stock prices 

In [None]:
merged_df = merged_df[merged_df['price'] >= 0]

### 2. Clean the text

In [None]:
text_list = []
text_len = []
for text in merged_df['risk']:
    text = unicodedata.normalize('NFKD', text.replace("\'", "'").replace("\ in\ form", " inform").replace("\n", " ").lower().strip())
    text_list.append(text)
    cnt = word_count(text)
    text_len.append(cnt)

### 3. Add more metadata

In [None]:
# add the cleaned text as a column to the df
merged_df['text'] = text_list

In [None]:
merged_df['word_count'] = text_len

In [None]:
merged_df.describe()

In [None]:
merged_df.head()

In [None]:
plt.subplots(figsize=(12,8))
plt.hist(merged_df.fiscal_year, bins=merged_df.fiscal_year.nunique())
plt.title("Histogram of fiscal year")
plt.show()

In [None]:
plt.subplots(figsize=(12,8))
plt.hist(merged_df.price, bins=merged_df.price.nunique())
plt.title("Histogram of stock price")
plt.show()

#### It seemed most of the stock prices fall into <= USD 250 bucket. 

In [None]:
# depicting the visualization
plt.scatter(merged_df.word_count, merged_df.price) 
plt.xlabel('word count') 
plt.ylabel('stock price') 
plt.title("Linear graph word count vs. stock price")
plt.show() 

#### From the graph above, we couldn't find any obvious correlation between word count and stock prices. 

#### I pulled the companies with stock prices above USD 750 for sanity check. It seemed they were accurate. 

In [None]:
merged_df[merged_df.price > 750]

#### Next, I removed the duplicates in the data frame based on cik and fiscal_year, and removed outliers with stock prices greater than USD 250. 

In [None]:
data = merged_df.drop_duplicates(subset=['cik', 'fiscal_year'])

In [None]:
data = data[data.price <= 250]

In [None]:
len(data)

In [None]:
# depicting the visualization after dropping duplicates and outliners 
plt.scatter(data.word_count, data.price) 
plt.xlabel('word count') 
plt.ylabel('stock price') 
plt.title("Linear graph word count vs. stock price")
plt.show() 

In [None]:
# depicting the visualization
plt.scatter(data.fiscal_year, data.word_count, color="red") 
plt.xlabel('fiscal year') 
plt.ylabel('word count') 
plt.title("Linear graph word count vs. fiscal year")
plt.show() 

#### From the graph above, we couldn't find obvious trend of either decreasing or increasing word count as time goes by. 

### 4. Generate a word cloud based on all the risk text

In [None]:
#convert list to string and generate
all_risks=(" ").join(data.risk)

In [None]:
wordcloud = WordCloud(width = 1000, height = 500, background_color="white").generate(all_risks)
plt.figure(figsize=(15,8))
plt.imshow(wordcloud)
plt.axis("off")
plt.savefig("wordcloud"+".png", bbox_inches='tight')
plt.show()
plt.close()

### 5. Save the cleaned data frame

In [None]:
data.to_csv("data.csv", encoding='utf-8', sep=',', header='true')

In [None]:
len(data)

## IV: Text Analysis

### 0-0. Create gold labels

* Create a vector **y_binary** of gold labels for stock prices. 0 stands for below-average stock prices and 1 represents above-average stock prices. 

In [None]:
y_binary = list(map(lambda x : 0 if x < np.mean(data.price) else 1, data['price']))

In [None]:
len(y_binary)

In [None]:
sum(y_binary) / len(y_binary)

### 0-1. Helper Functions

In [None]:
############## code from INFO 6350 problem set code ##############
def plot_compare(X, labels, title, reduce=True, alpha=0.2):
    '''
    Takes an array of object data, a set of cluster labels, and a title string
    Reduces dimensions to 2 and plots the clustering.
    Returns nothing.
    '''
    import matplotlib.pyplot as plt
    import seaborn as sns
    from   sklearn.decomposition import TruncatedSVD

    if reduce:
        # TruncatedSVD is fast and can handle sparse inputs
        # PCA requires dense inputs; MDS is slow
        coordinates = TruncatedSVD(n_components=2).fit_transform(X)
    else:
        # Optionally handle 2-D inputs
        coordinates = X
    
    # Set up figure
    fig, ax = plt.subplots(figsize=(12,6))

    # Unlabeled data
    plt.subplot(121) # 1x2 plot, position 1
    plt.scatter(
        coordinates[:, 0], 
        coordinates[:, 1], 
        alpha=alpha, # Set transparency so that we can see overlapping points
        linewidths=0 # Get rid of marker outlines
    )
    plt.title("Unclustered data")

    # Labeled data
    plt.subplot(122)
    sns.scatterplot(
        x=coordinates[:, 0], 
        y=coordinates[:, 1],
        hue=labels,
        alpha=alpha,
        palette='viridis',
        linewidth=0
    )
    plt.title(title)
    plt.show()

In [None]:
############## code from INFO 6350 problem set code ##############
def pull_samples(texts, labels, n=3):
    '''
    Takes lists of texts and an array of labels, as well as number of samples to return per label.
    Prints sample texts belonging to each label.
    '''
    texts_array = np.array(texts) # Make the input text list easily addressable by NumPy
    for label in np.unique(labels): # Iterate over labels
        print("Label:", label)
        sample_index = np.where(labels == label)[0] # Limit selection to current label
        print("Number of texts in this cluster:", len(sample_index), '\n')
        chosen = np.random.choice(sample_index, size=n) # Sample n texts with this label
        for choice in chosen:
            print("Sample text:", choice)
            # print(str(texts_array[choice]).split(" 0")[0], '\n') # Print each sampled text
            print(str(texts_array[choice])[1:80], '\n') # Print each sampled text
        print("###################################")

### Q1: What are the tones or sentiments used to describe the risks of companies? Are they related to the stock prices? 

### 1-1. Sentiment Analysis

In [None]:
# Making stopwords list
stoplist = stopwords.words('english')
for el in [i for i in string.punctuation]:
    stoplist.append(el)

In [None]:
emolex_file = os.path.join('emolex.txt')

In [None]:
############## code from INFO 6350 problem set code ##############
def read_emolex(filepath=None):
    '''
    Takes a file path to the emolex lexicon file.
    Returns a dictionary of emolex sentiment values.
    '''
    if filepath==None: # Try to find the emolex file
        filepath = os.path.join('..','..','data','lexicons','emolex.txt')
        if os.path.isfile(filepath):
            pass
        elif os.path.isfile('emolex.txt'):
            filepath = 'emolex.txt'
        else:
            raise FileNotFoundError('No EmoLex file found')
    emolex = defaultdict(None) # Like Counter(), defaultdict eases dictionary creation
    with open(filepath, 'r') as f:
    # emolex file format is: word emotion value
        for line in f:
            word, emotion, value = line.strip().split()
            if emolex.get(word) == None:
                emolex[word] = {}
            emolex[word][emotion] = int(value)
    return list(emolex.items())[0:]

In [None]:
# Get EmoLex data. Make sure you set the right file path above.
emolex = read_emolex(emolex_file)

In [None]:
# sentence_sentiment_score from INFO 6350 problem set code
def sentence_sentiment_score(toks, lexicon = emolex):
    total = 0
    emo_dict = defaultdict(lambda: 0)
    
    emotions = ['anger', 'anticipation','disgust','fear','joy','negative','positive','sadness','surprise', 'trust']
    
    
    for word in toks:
            total += 1
            for emotion in emotions:
                try:
                    emo_dict[emotion] += lexicon[word][emotion]
                except:
                    continue
    
    for emotion in emotions:
        if total > 0:
            emo_dict[emotion] /= total
        
    return emo_dict

In [None]:
def getSentScore(sentence_dicts, data, index):
    
    sum_sent_dict =  {'anger': 0 , 'anticipation': 0,'disgust': 0,'fear': 0,'joy': 0,'negative': 0,'positive': 0,'sadness': 0,'surprise': 0, 'trust': 0}
    
    for sentence_dict in sentence_dicts:
        for emotion in sentence_dict:
            sum_sent_dict[emotion] += sentence_dict[emotion]
    
    for emotion in sum_sent_dict.keys():
        sum_sent_dict[emotion] /= len(sentence_dicts)
        data.at[index, emotion] = sum_sent_dict[emotion]

In [None]:
# tokenize_text from INFO 6350 problem set code
def tokenize_text(text, stops=[]):
    sentences = []
    for sent in sent_tokenize(text.lower()):
        sentences.append([word for word in word_tokenize(sent) if word not in stops])
    return sentences

In [None]:
len(data)

In [None]:
#### Adding sentiment score columns
size = len(data)

data['anger'] = np.zeros(size)
data['anticipation'] = np.zeros(size)
data['disgust'] = np.zeros(size)
data['fear'] = np.zeros(size)
data['joy'] = np.zeros(size)
data['negative'] = np.zeros(size)
data['positive'] = np.zeros(size)
data['sadness'] = np.zeros(size)
data['surprise'] = np.zeros(size)
data['trust'] = np.zeros(size)

In [None]:
%%time

for index, text in enumerate(data['risk']):
    sentence_dicts = []
    for sentence in tokenize_text(text, stops=stoplist):
        sentence_dicts.append(sentence_sentiment_score(sentence))
    getSentScore(sentence_dicts, data, index)

In [None]:
data.describe()

In [None]:
# Vectorize
vectorizer = TfidfVectorizer(
    encoding = 'utf-8',
    strip_accents = 'unicode',
    lowercase = True,
    min_df = 0.01,
    max_df = 0.9,
    use_idf=True
)

In [None]:
len(data.fiscal_year)

In [None]:
# perform vectorization
X = vectorizer.fit_transform(data.risk.values.astype('U'))
print("Shape of the feature matrix", X.shape)

In [None]:
# standard-scale feature matrix
X = StandardScaler().fit_transform(X.todense())

### 1-2. Clustering

In [None]:
y_kmeans = KMeans(n_clusters=2).fit_predict(X) # this output is the cluster labels

# Print label vector shape
print('Label vector shape: ', y_kmeans.shape)

print("Using KMeans clustering with n=2 clusters; we are assuming that the clusters are detective and non-detective novels.")

# Plot results
plot_compare(X, y_kmeans, 'k-Means (predicted) labels', reduce=True, alpha=0.8)

In [None]:
pull_samples(data, y_kmeans, 5)

### Q2: What are the most important features in the **risk sections** in the 10-k reports that can make predictions or classifications on the companies' stock prices?

### 2-1. Build a token-based classifier

In [None]:
%%time
# Select best features
selector = SelectKBest(score_func=mutual_info_regression, k=50)

# Print the shape of your new feature matrix
X_top = selector.fit_transform(X, y_binary)
print("Shape of the combined matrix with 300 selected features: ", X_top.shape)

In [None]:
# Calculate a 10-fold cross-validated accuracy score using a logistic regression classifier on your selected feature data.
# Cross-validate the logistic regression classifier on full input data
print("Mean cross-validated accuracy scores:", 
      np.mean(cross_val_score(LogisticRegression(), X_top, y_binary, scoring='accuracy', cv=10)))

In [None]:
feature_names = vectorizer.get_feature_names()
feature_names_ = [feature_names[i] for i in selector.get_support(indices=True)]

In [None]:
feature_names_

### 2-2. Build a word-embedding-based classifier

In [None]:
def get_doc_embedding(doc, nlp):    
    # remove_noninformative_tokens
    tokens = nlp(doc)
    culled = []
    culled = [token for token in tokens if not (token.is_stop or token.is_punct or token.is_space) and token.has_vector]
    '''
    Takes two lists of spacy token objects.
    Returns cosine similarity between their embedding representations.
    '''
    mean_vector_culled = np.mean([token.vector for token in culled], axis=0)
       
    return mean_vector_culled

In [None]:
nlp = spacy.load("en_core_web_lg") # Note '_lg' = large model

In [None]:
%%time
X_embedding = []
X_embedding = np.zeros((len(data.risk), nlp.vocab.vectors_length))

for i, content in enumerate(data.risk):
    X_embedding[i] = get_doc_embedding(content, nlp)  

In [None]:
print("Shape of embedding matrix: ", X_embedding.shape)

In [None]:
# standard-scale feature matrix
X_embedding = StandardScaler().fit_transform(X_embedding)

In [None]:
# Calculate a 10-fold cross-validated accuracy score using a logistic regression classifier on your selected feature data.
# Cross-validate the logistic regression classifier on full input data

print("Mean cross-validated accuracy scores:", 
      np.mean(cross_val_score(LogisticRegression(max_iter=500), X_embedding, y_binary, scoring='accuracy', cv=10)))

### 2-3. Evaluate regression performance

In [None]:
print("Token-based Mean 10-fold cross-validated R^2:", 
      np.mean(cross_val_score(LinearRegression(), X_top300, y_binary, scoring='r2', cv=10)))

In [None]:
print("Embedding-based Mean 10-fold cross-validated R^2:", 
      np.mean(cross_val_score(LinearRegression(), X_embedding, y_binary, scoring='r2', cv=10)))

### 2-4. Improve classification performance

#### 2-4-1. Improve token-based classifier

##### Feature Engineering
* Let's increase the number of features from 300 to 800

In [None]:
selector_k = SelectKBest(score_func=mutual_info_regression, k=20)

In [None]:
# Print the shape of your new feature matrix
X_token_k = selector_k.fit_transform(X, y_binary)
print("Shape of the matrix with 20 selected features: ", X_token_k.shape)

In [None]:
print("Mean cross-validated accuracy scores:", 
      np.mean(cross_val_score(LogisticRegression(), X_token_k, y_binary, scoring='accuracy', cv=10)))

In [None]:
%%time
print("Random Forest === Mean cross-validated accuracy scores:", 
      np.mean(cross_val_score(RandomForestClassifier(max_features="auto"), X_token_k, y_binary, scoring='accuracy', cv=10)))

In [None]:
%%time
print("Decision Tree === Mean cross-validated accuracy scores:", 
      np.mean(cross_val_score(DecisionTreeClassifier(max_depth=100), X_token_k, y_binary, scoring='accuracy', cv=10)))

In [None]:
feature_names_k = [feature_names[i] for i in selector_k.get_support(indices=True)]
feature_names_k

#### 2-4-2. Improve embedding-token-based classifier

##### Try SVM classifier
* linear SVM 
* non-linear SVM

In [None]:
%%time
print("Mean cross-validated accuracy scores:", 
      np.mean(cross_val_score(SVC(), X_embedding, y_binary, scoring='accuracy', cv=10)))

### 2-5. Conclusion on Q2

# Part 4: Results and Discussion

# Part 5: Reflection

# Part 6: References

* https://stackoverflow.com/questions/48687857/python-json-list-to-pandas-dataframe

* https://www.youtube.com/watch?v=fw4gK-leExw&ab_channel=IzzyAnalytics

# Part 7: Responsibility Statement

I completed this project on my own. 