# Introduction to Strength Analysis Using Fama French Model

In this notebook, titled `Strength_Analysis_GA.ipynb`, we delve into training a strength score model based on the Fama French (FF) three-factor model. This model is designed to predict the financial strength of news items related to cryptocurrencies by analyzing their impact on market behavior. The development dataset includes `Strength/FF3_daily.csv` for the Fama French factors and `Strength/news_short.xlsx` for the news headlines, while the deployment dataset utilizes `GA_data_wsentiment.csv` as input and outputs to `GA_Results.csv`. The final output, `GA_Results.csv`, is structured similarly to `CryptoGDelt2022.csv`, albeit with some additional date fields used for organizing news by date and time.

## Objective

The aim is to create a predictive model that assesses the strength of cryptocurrency-related news by examining its correlation with market returns, based on the Fama French three-factor model. By incorporating variables such as market risk, size, and value factors from the FF model, we seek to quantify the financial impact of each news article. The enriched dataset, `GA_Results.csv`, will serve as a comprehensive tool for analyzing the influence of news on cryptocurrency markets, enhancing investment strategies and market analysis.

## Methodology

The process involves several key steps:

1. **Data Preparation**: Import and preprocess the Fama French daily factors and news headlines for model training. This includes cleaning the news data and aligning it with the corresponding market data based on publication dates.

2. **Model Training**: Utilize the Fama French three-factor model to establish a baseline for expected market returns. Train a classifier to predict the strength score of news items, distinguishing between news with significant positive or negative impacts on market behavior.

3. **Model Deployment**: Apply the trained model to the `GA_data_wsentiment.csv` dataset, which contains pre-processed news items with appended sentiment analysis. The model assesses each news item's impact based on its content and sentiment, assigning a strength score that reflects its potential influence on the cryptocurrency market.

4. **Output Generation**: The resulting dataset, `GA_Results.csv`, includes the original news data along with the computed strength scores and associated financial metrics. This dataset facilitates comprehensive analysis, allowing for a deeper understanding of how news sentiment and content correlate with market movements.

5. **Analysis and Visualization**: Perform further analysis on the enriched dataset to identify patterns and insights. Visualize the relationship between news strength scores and market reactions, providing actionable intelligence for market participants.

## Usage

This notebook is crucial for financial analysts, investors, and researchers aiming to gauge the impact of news on cryptocurrency markets. It enables users to filter news based on its financial strength, prioritize analysis on significant news items, and develop informed market strategies.

## Requirements

- Python 3.x
- Pandas and NumPy for data manipulation
- Pandas DataReader and yfinance for financial data retrieval
- Statsmodels for statistical modeling
- NLTK for text preprocessing
- Scikit-learn for machine learning tasks
- Matplotlib and Seaborn for data visualization
- Various custom functions and pipelines for data preprocessing and analysis

## Conclusion

By integrating financial models with natural language processing and machine learning, this notebook presents a novel approach to analyzing the financial strength of cryptocurrency-related news. The `GA_Results.csv` dataset not only serves as a valuable resource for understanding market dynamics but also aids in developing predictive models that can forecast market reactions to news, thereby offering strategic advantages in the volatile cryptocurrency market.


In [135]:
## Imports
import warnings
import pandas as pd
import numpy as np
import pandas_datareader.data as pdr
import datetime as dt
import yfinance as yf
import statsmodels.api as sm
import getFamaFrenchFactors as gff
import seaborn as sns
import nltk
from nltk.stem import *
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import re
from datetime import date, timedelta 
from gdeltdoc import GdeltDoc, Filters
import eventstudy as es
from eventstudy import excelExporter
import string
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
from sklearn.base import BaseEstimator, TransformerMixin
from math import exp
import pickle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from scipy.stats import uniform
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

warnings.filterwarnings('ignore')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/arturoolivera/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/arturoolivera/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/arturoolivera/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Functions

In [39]:
#The below function is a supporting function useful for calculating the daily returns of a specific ticker, based on a
#specific range of dates given by a news dataframe.

def calculated_returns(news_df,date_field,ticker):
    """
    news_df: it will include the news dataframe, with the dates of each piece of news.
    date_field: the field including the date within the dataframe
    ticker: the crypto ticket
    """
    #Copying the dataframe:
    df = news_df.copy()
    
    #Importing Daily Values of FF3 from CSV:
    ff3_daily=pd.read_csv("FF3_daily.txt",parse_dates=['date'])
    ff3_daily.rename(columns={"date": 'Date'}, inplace=True)
    ff3_daily.set_index('Date', inplace=True)

    #Converting the date for proper usage:
    df["date_parsed"]=pd.to_datetime(df[date_field])
    
    #Extracting the dates:
    min_date = df["date_parsed"].min()
    max_date = df["date_parsed"].max()
    start = dt.datetime.strftime(min_date, "%Y-%m-%d")
    end = dt.datetime.strftime(max_date, "%Y-%m-%d")
    
    #print("The earliest news is from "+start)
    #print("The latest news is from "+end)
    
    #Length of frame:
    days = dt.datetime.strptime(end, "%Y-%m-%d")-dt.datetime.strptime(start, "%Y-%m-%d")
    interval = days.days + 1
    
    #print("The number of days to analyze is {}.".format(interval))
   
    #The first run will be calculated using data from the day before of the desired calculation date:
    min_date = min_date - dt.timedelta(days=1)
    
    expected_return_array = []
    day_array = []
    real_return_array = []
    
    for i in range(0,interval):
        
        #Setting the time to get datapoints:
        real_start = dt.datetime.strftime(min_date-dt.timedelta(700), "%Y-%m-%d")
        start = dt.datetime.strftime(min_date, "%Y-%m-%d")
        
        #Parsing Ticket Data: 700 hunderd days before the min date:
        stock_data = yf.download(ticker, real_start, start)
        #Getting the % ratio of the selected ticker
        stock_returns = stock_data['Adj Close'].resample('d').last().pct_change().dropna()
        stock_returns.name = "Day_Rtn"
        #Creating a single table: merging the 700 days window
        ff_data = ff3_daily.merge(stock_returns,on='Date')
        
        #Creating a linear Model to caclulate the Beta Coefficients:
        X = ff_data[['Mkt-RF', 'SMB', 'HML']]
        y = ff_data['Day_Rtn'] - ff_data['RF']
        X = sm.add_constant(X)
        ff_model = sm.OLS(y, X).fit()
        intercept, b1, b2, b3 = ff_model.params
        
        rf = ff_data['RF'].mean()
        market_premium = ff3_daily['Mkt-RF'].mean()
        size_premium = ff3_daily['SMB'].mean()
        value_premium = ff3_daily['HML'].mean()
        
        #Expected daily return:
        expected_daily_return = rf + b1 * market_premium + b2 * size_premium + b3 * value_premium 
        #Getting the real return of the last date:
        real_return = ff_data.iloc[ff_data.shape[0]-1]['Day_Rtn']
        #Updating the date:
        min_date = min_date + dt.timedelta(days=1)
    
        expected_return_array.append(expected_daily_return)
        day_array.append(min_date)
        real_return_array.append(real_return)
        
    return (expected_return_array, real_return_array, day_array)

In [40]:
#Function to Model the Target Variable of a dataframe news using Famma French.

def abnoraml_return_calculation(news_df,date_field,ticker):
    """
    news_df: it will include the news dataframe, with the dates of each piece of news.
    date_field: the field including the date within the dataframe
    ticker: the crypto ticket
    """
    #Copying the dataframe:
    df = news_df.copy()
    
    #Converting the date for proper usage:
    df["date_parsed"]=pd.to_datetime(df[date_field])
    df["date_format"] = df["date_parsed"].dt.date
    
    #Extracting the dates:
    min_date = df["date_parsed"].min()
    max_date = df["date_parsed"].max()
    start = dt.datetime.strftime(min_date, "%Y-%m-%d")
    end = dt.datetime.strftime(max_date, "%Y-%m-%d")
    
    print("The earliest news is from "+start)
    print("The latest news is from "+end)
    
    #Length of frame:
    days = dt.datetime.strptime(end, "%Y-%m-%d")-dt.datetime.strptime(start, "%Y-%m-%d")
    interval = days.days + 1
    
    print("The number of days to analyze is {}.".format(interval))


    #######################################
    #Generating the files for event study:
    #######################################
    
    #Reading Famma French Data:
    ff3_daily=pd.read_csv("FF3_daily.txt")
    #Getting Stock data to append calculate the real return:
    real_start = dt.datetime.strftime(min_date-dt.timedelta(100), "%Y-%m-%d")
    real_end = dt.datetime.strftime(min_date+dt.timedelta(100), "%Y-%m-%d")
    stock_data = yf.download(ticker, real_start, real_end)

    #Storing the data for later use in the event:
    ff3_daily.to_csv("{}_famafrench.csv".format(ticker), index=False, date_format='%Y%m%d')
    
    stock_data[ticker] = stock_data["Adj Close"].pct_change()
    stock_data = stock_data.dropna()
    stock_data.reset_index(level=0, inplace=True)
    stock_data = stock_data[['Date',ticker]].copy()
    stock_data.columns = ['date',ticker]
    stock_data.to_csv("{}_returns.csv".format(ticker), index=False, date_format='%Y-%m-%d')
    
    ##############################################################################################
    #Now we are going to calculate the whole process with the daily and expected returns window:
    ##############################################################################################

    #Event Study files definition:
    es.Single.import_FamaFrench("{}_famafrench.csv".format(ticker))
    es.Single.import_returns("{}_returns.csv".format(ticker))
    
    #Loop definition of dates
    dates_for_eventstudy = df["date_format"].drop_duplicates().sort_values()

    i = 1
    listAR=[]
    dates=[]
    for date in dates_for_eventstudy:
        date = dt.datetime.strftime(date, "%Y-%m-%d")
        #print(date)
        try:
            event = es.Single.FamaFrench_3factor(
                security_ticker = ticker,
                event_date = np.datetime64(date),
                event_window = (-2,+4), 
                estimation_size = 50,
                buffer_size = 30)

            listAR.append(event.AR)
            dates.append(date)
        
        except:
            list_nans = []
            for nans in range(event.event_window_size):
                list_nans.append(np.nan)
            listAR.append(list_nans)
            dates.append(date)
    
    columns_ar=[]
    
    for i in range(event.event_window[0],event.event_window[1]+1):
        columns_ar.append("AR"+str(i))

    df_AR = pd.DataFrame(listAR, columns=columns_ar)
    df_test2 = pd.DataFrame({'date':dates}).join(df_AR)       
    df_test3 = df_test2.dropna()
    df_test3["date"] = pd.to_datetime(df_test3["date"])
    
    #Calculating the expected return for each day of the dates in the training news as well as the real one:
    expected_day, real_day, day_array = calculated_returns(df,date_field,ticker)
    returns_df = pd.DataFrame()
    returns_df['date']=day_array
    returns_df['format_date']=returns_df['date'].dt.date
    returns_df['expected_return']=expected_day
    returns_df['real_return']=real_day
    returns_df['format_date'] = pd.to_datetime(returns_df['format_date'])
    
    df_test4 = pd.merge(df_test3,returns_df, left_on ="date", right_on="format_date").drop(columns=["date_y","format_date"],axis=1)
    
    
    #We are changing this part, as we are using the expcted daily return for each specific date!
    #Instead of using the expected_daily_return as a constant value, we will use the expected return date we have
    #estimated for each specific day.
    #The threshold as agreed with Prof. Manoel G. is @ 6 days, but the ratio of abnormal returns is very high in
    #relation to the 180 days of 6 months. Adjusted value to 30 days.
    
    relevant_matrix_pos = pd.DataFrame()
    relevant_matrix_neg = pd.DataFrame()
    i = 0
    for column_ in df_test4.columns[1:len(df_test4.columns)-2]:
        relevant_matrix_pos[str(i)] = df_test4[column_]>df_test4["expected_return"]*180
        relevant_matrix_neg[str(i)] = df_test4[column_]<-df_test4["expected_return"]*180
        i += 1
        
   
    df_test4["Relevant_pos"] = relevant_matrix_pos[list(relevant_matrix_pos.columns)].any(axis=1).astype(int)
    df_test4["Relevant_neg"] = 1*(relevant_matrix_neg[list(relevant_matrix_neg.columns)].any(axis=1).astype(int))
    
    df_test4['target'] = df_test4["Relevant_pos"] | df_test4["Relevant_neg"]
    
    df_test4['date'] = df_test4['date_x'].dt.date
    df['date'] = df['date_parsed'].dt.date
    
    df_result = pd.merge(df,df_test4, left_on ="date", right_on="date").drop(columns=["date_x"],axis=1)
    
    return df_result

In [101]:
# Custom transformer for text cleaning
class TextCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        cleaned_data = X.astype(str).map(lambda x: x.lower())
        cleaned_data = cleaned_data.map(lambda x: re.sub('[^A-Za-z0-9]+', ' ', x))
        return cleaned_data

# Custom transformer for stop words removal
class StopWordsRemover(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        removed_stop_words = X.apply(lambda x: ' '.join([word for word in x.split() if word not in self.stop_words]))
        return removed_stop_words

# Custom transformer for lemmatization
class Lemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        lemmatized_data = X.apply(lambda x: ' '.join([self.lemmatizer.lemmatize(word) for word in word_tokenize(x)]))
        return lemmatized_data

### Collecting news (before sentiment analysis) for model training.

In [84]:
news_df = pd.read_csv('news_short.csv')

In [85]:
input_df = news_df[['title','seendate']]
news_df2 = pd.DataFrame()
news_df2["date_parsed"]=pd.to_datetime(news_df["seendate"])
news_df2["date_format"] =news_df2["date_parsed"].dt.date

In [13]:
input_df.head()

Unnamed: 0,title,seendate
0,LongHash Ventures and Terraform Labs Join Forc...,20220406T163000Z
1,TERRA . DO TO COMPETE IN FINAL 20 GROUP FOR ED...,20220406T001500Z
2,Terra founder plans to back its stablecoin wit...,20220406T213000Z
3,Crypto platform Leap raises $3 . 2 mn in fundi...,20220406T081500Z
4,Can THORchain Keep Surging ? | The Motley Fool,20220406T120000Z


In [14]:
news_df2.head()

Unnamed: 0,date_parsed,date_format
0,2022-04-06 16:30:00+00:00,2022-04-06
1,2022-04-06 00:15:00+00:00,2022-04-06
2,2022-04-06 21:30:00+00:00,2022-04-06
3,2022-04-06 08:15:00+00:00,2022-04-06
4,2022-04-06 12:00:00+00:00,2022-04-06


---
### Fama French Three Factor
Creating the Bitcoin DataFrame

In [27]:
result_btc = abnoraml_return_calculation(input_df,'seendate','BTC-USD')

The earliest news is from 2022-03-12
The latest news is from 2022-04-07
The number of days to analyze is 27.
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 comp

In [51]:
result_btc.head(2)

Unnamed: 0,title,seendate,date_parsed,date_format,date,AR-2,AR-1,AR0,AR1,AR2,AR3,AR4,expected_return,real_return,Relevant_pos,Relevant_neg,target
0,LongHash Ventures and Terraform Labs Join Forc...,20220406T163000Z,2022-04-06 16:30:00+00:00,2022-04-06,2022-04-06,0.032101,-0.003755,-0.04461,0.018444,-0.045935,0.008446,-0.022411,0.000304,0.00364,0,0,0
1,TERRA . DO TO COMPETE IN FINAL 20 GROUP FOR ED...,20220406T001500Z,2022-04-06 00:15:00+00:00,2022-04-06,2022-04-06,0.032101,-0.003755,-0.04461,0.018444,-0.045935,0.008446,-0.022411,0.000304,0.00364,0,0,0


---
### Model training - NLP Basic Analysis

In [31]:
training_df = result_btc[['title','target']]

In [47]:
text_processing_pipeline = Pipeline([
    ('text_cleaning', TextCleaner()),
    ('stop_words_removal', StopWordsRemover()),
    ('lemmatization', Lemmatizer())
])

In [48]:
processed_data = text_processing_pipeline.fit_transform(training_df['title'])

In [49]:
training_df['lemmetized_titles'] = processed_data

In [50]:
#Chceking whether lemmatization has been applied:
training_df.head(5)

Unnamed: 0,title,target,lemmetized_titles
0,longhash ventures and terraform labs join forc...,0,longhash venture terraform lab join force adva...
1,terra do to compete in final 20 group for edte...,0,terra compete final 20 group edtech competitio...
2,terra founder plans to back its stablecoin wit...,0,terra founder plan back stablecoin basket cryp...
3,crypto platform leap raises 3 2 mn in funding ...,0,crypto platform leap raise 3 2 mn funding coin...
4,can thorchain keep surging the motley fool,0,thorchain keep surging motley fool


In [53]:
X_train, X_test, y_train, y_test = train_test_split(training_df['lemmetized_titles'],training_df['target'], stratify=training_df['target'],test_size=0.3)

In [54]:
# Building a Naive Bayes Classifier
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [55]:
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
score_pred = model.predict_proba(X_test)[:,1]

In [56]:
accuracy_score(y_test, y_pred)

0.6435268958710314

In [57]:
confusion_matrix(y_test, y_pred)

array([[1573, 1308],
       [ 859, 2339]])

In [60]:
#pickle.dump(model, open('BTCStrengthScoreModel.sav', 'wb'))

### Creating a model taking into account sentiment

In [136]:
news_ws_df = pd.read_csv('news_short_wsentiment.csv')

In [137]:
training_df2 = pd.DataFrame({
    'title': news_ws_df['title'],
    'sentiment_class': news_ws_df['sentiment_class'],
    'target': result_btc['target']
})

In [138]:
training_df2.head()

Unnamed: 0,title,sentiment_class,target
0,LongHash Ventures and Terraform Labs Join Forc...,1,0
1,TERRA . DO TO COMPETE IN FINAL 20 GROUP FOR ED...,0,0
2,Terra founder plans to back its stablecoin wit...,1,0
3,Crypto platform Leap raises $3 . 2 mn in fundi...,1,0
4,Can THORchain Keep Surging ? | The Motley Fool,0,0


In [139]:
text_pipeline = Pipeline([
    ('text_cleaning', TextCleaner()),
    ('stop_words_removal', StopWordsRemover()),
    ('lemmatization', Lemmatizer()),
    ('tfidf', TfidfVectorizer()),
])

# Define the full preprocessing pipeline with ColumnTransformer
# Note: Now we directly apply OneHotEncoder to the 'sentiment_class' column without using a custom selector
preprocessor = ColumnTransformer(
    transformers=[
        ('text', text_pipeline, 'title'),  # Apply text_pipeline to 'title' column
        ('sentiment', OneHotEncoder(categories=[[-1, 0, 1]], drop='first'), ['sentiment_class'])  # Directly apply OneHotEncoder to 'sentiment_class'
    ],
    remainder='drop'  # Drop other columns not specified in transformers
)

# Full pipeline with classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

In [140]:
# Prepare data
X = training_df2[['title', 'sentiment_class']]  # 'text' and 'sentiment_score' are the columns to be used
y = training_df2['target']  # Target variable

In [141]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

In [148]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [149]:
accuracy_score(y_test, y_pred)

0.650764928442178

In [150]:
confusion_matrix(y_test, y_pred)

array([[1721, 1160],
       [ 963, 2235]])

In [142]:
param_distributions = {
    'classifier__n_estimators': randint(100, 1000),  # Number of trees in the forest
    'classifier__max_features': ['auto', 'sqrt', 'log2'],  # Number of features to consider at every split
    'classifier__max_depth': randint(10, 100),  # Maximum number of levels in tree
    'classifier__min_samples_split': randint(2, 20),  # Minimum number of samples required to split a node
    'classifier__min_samples_leaf': randint(1, 20),  # Minimum number of samples required at each leaf node
    'classifier__bootstrap': [True, False]  # Method of selecting samples for training each tree
}

In [143]:
random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_distributions,
    n_iter=10,  # Number of parameter settings sampled
    cv=5,       # 5-fold cross-validation
    verbose=1,  # Show more logs
    random_state=42,  # For reproducibility
    scoring='accuracy'
)

In [144]:
random_search.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


In [145]:
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)

In [146]:
accuracy_score(y_test, y_pred)

0.5788781049514723

In [147]:
confusion_matrix(y_test, y_pred)

array([[ 747, 2134],
       [ 426, 2772]])

---
#### Applying Model to the Study Case

In [152]:
study_df = pd.read_csv('GA_data_wsentiment.csv')

In [153]:
study_df.head(5)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry,relevance_probability,relevance_class,lemmetized_titles,sentiment_negative_probability,sentiment_neutral_probability,sentiment_positive_probability,sentiment_class
0,0,0,https://news.yahoo.com/ai-scams-missouri-warns...,,AI Scams : Missouri warns voices of loved ones...,20240219T204500Z,https://media.zenfs.com/en/ktvi_articles_498/2...,news.yahoo.com,English,United States,0.210185,0.0,ai scam missouri warns voice loved one used fraud,0.999597,0.000242,0.00016,-1
1,1,1,https://www.americanbanker.com/opinion/regulat...,,Regulators should reexamine their assumptions ...,20240219T194500Z,https://source-media-brightspot.s3.us-east-1.a...,americanbanker.com,English,United States,0.565864,1.0,regulator reexamine assumption brokered deposit,0.989373,0.010579,4.8e-05,-1
2,2,2,https://biztoc.com/x/97e1450bfef84362,,South Korean Political Party Eyes Crypto Revol...,20240219T130000Z,https://c.biztoc.com/p/97e1450bfef84362/s.webp,biztoc.com,English,,0.697517,1.0,south korean political party eye crypto revolu...,0.998693,0.000833,0.000474,-1
3,3,3,https://biztoc.com/x/5c2110519540e5cf,,Unraveling the Mystery Behind XRP Price Underp...,20240219T103000Z,https://c.biztoc.com/p/5c2110519540e5cf/s.webp,biztoc.com,English,,0.369046,0.0,unraveling mystery behind xrp price underperfo...,0.001967,0.996818,0.001215,0
4,4,4,https://biztoc.com/x/2f038851769a9841,,Cryptocurrency Rankings : Solana Claims the Co...,20240219T181500Z,https://c.biztoc.com/p/2f038851769a9841/s.webp,biztoc.com,English,,0.512854,1.0,cryptocurrency ranking solana claim coveted fo...,0.000429,0.987183,0.012388,0


In [154]:
training_df = pd.DataFrame({
    'title': study_df['title'],
    'sentiment_class': study_df['sentiment_class']
})

In [155]:
y_pred = pipeline.predict(training_df)
score_pred = model.predict_proba(training_df)[:,1]

In [161]:
study_df['strength_probability']=score_pred[1]

In [162]:
study_df['strength_score']=y_pred

In [163]:
#Getting the Real Percent per Day
#Parsing Ticket Data: 700 hunderd days before the min date:
stock_data = yf.download('BTC-USD', '2024-01-29', '2024-02-19')
#Getting the % ratio of the selected ticker
stock_returns = stock_data['Adj Close'].resample('d').last().pct_change().dropna()
stock_returns.name = "Day_Rtn"

[*********************100%%**********************]  1 of 1 completed


In [164]:
len(stock_returns.to_list())

20

In [165]:
returns_df = pd.DataFrame()

In [167]:
returns_df['date']=stock_returns.index
returns_df['real_percent_change']=stock_returns.to_list()

In [168]:
returns_df['date'] = pd.to_datetime(returns_df['date'])

In [169]:
returns_df.head(5)

Unnamed: 0,date,real_percent_change
0,2024-01-30,-0.007754
1,2024-01-31,-0.008614
2,2024-02-01,0.011581
3,2024-02-02,0.002556
4,2024-02-03,-0.004483


In [172]:
study_df['date_format'] = pd.to_datetime(study_df['seendate']).dt.date
study_df['date_format'] = pd.to_datetime(study_df['date_format'])

In [173]:
study_df.dtypes

Unnamed: 0.1                               int64
Unnamed: 0                                 int64
url                                       object
url_mobile                                object
title                                     object
seendate                                  object
socialimage                               object
domain                                    object
language                                  object
sourcecountry                             object
relevance_probability                    float64
relevance_class                          float64
lemmetized_titles                         object
sentiment_negative_probability           float64
sentiment_neutral_probability            float64
sentiment_positive_probability           float64
sentiment_class                            int64
strength_probability                     float64
strength_score                             int64
date_format                       datetime64[ns]
dtype: object

In [174]:
study_df.head(5)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry,relevance_probability,relevance_class,lemmetized_titles,sentiment_negative_probability,sentiment_neutral_probability,sentiment_positive_probability,sentiment_class,strength_probability,strength_score,date_format
0,0,0,https://news.yahoo.com/ai-scams-missouri-warns...,,AI Scams : Missouri warns voices of loved ones...,20240219T204500Z,https://media.zenfs.com/en/ktvi_articles_498/2...,news.yahoo.com,English,United States,0.210185,0.0,ai scam missouri warns voice loved one used fraud,0.999597,0.000242,0.00016,-1,0.526015,1,2024-02-19
1,1,1,https://www.americanbanker.com/opinion/regulat...,,Regulators should reexamine their assumptions ...,20240219T194500Z,https://source-media-brightspot.s3.us-east-1.a...,americanbanker.com,English,United States,0.565864,1.0,regulator reexamine assumption brokered deposit,0.989373,0.010579,4.8e-05,-1,0.526015,1,2024-02-19
2,2,2,https://biztoc.com/x/97e1450bfef84362,,South Korean Political Party Eyes Crypto Revol...,20240219T130000Z,https://c.biztoc.com/p/97e1450bfef84362/s.webp,biztoc.com,English,,0.697517,1.0,south korean political party eye crypto revolu...,0.998693,0.000833,0.000474,-1,0.526015,0,2024-02-19
3,3,3,https://biztoc.com/x/5c2110519540e5cf,,Unraveling the Mystery Behind XRP Price Underp...,20240219T103000Z,https://c.biztoc.com/p/5c2110519540e5cf/s.webp,biztoc.com,English,,0.369046,0.0,unraveling mystery behind xrp price underperfo...,0.001967,0.996818,0.001215,0,0.526015,1,2024-02-19
4,4,4,https://biztoc.com/x/2f038851769a9841,,Cryptocurrency Rankings : Solana Claims the Co...,20240219T181500Z,https://c.biztoc.com/p/2f038851769a9841/s.webp,biztoc.com,English,,0.512854,1.0,cryptocurrency ranking solana claim coveted fo...,0.000429,0.987183,0.012388,0,0.526015,1,2024-02-19


In [175]:
df_result = pd.merge(study_df,returns_df, left_on ="date_format", right_on="date")

In [176]:
df_result

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry,...,lemmetized_titles,sentiment_negative_probability,sentiment_neutral_probability,sentiment_positive_probability,sentiment_class,strength_probability,strength_score,date_format,date,real_percent_change
0,743,743,https://www.dailypioneer.com/2024/state-editio...,,Man defrauded of Rs 38 lakh by woman he met on...,20240218T054500Z,,dailypioneer.com,English,India,...,man defrauded r 38 lakh woman met matrimonial ...,0.000505,0.998322,0.001172,0,0.526015,1,2024-02-18,2024-02-18,0.008895
1,744,744,https://www.nyoooz.com/news/delhi/1719641/man-...,https://www.nyoooz.com/amp/news/delhi/1719641/...,Man defrauded of Rs 38 lakh by woman he met on...,20240218T194500Z,https://www.nyoooz.com/df-images/delhi/df-delh...,nyoooz.com,English,India,...,man defrauded r 38 lakh woman met matrimonial ...,0.000505,0.998322,0.001172,0,0.526015,1,2024-02-18,2024-02-18,0.008895
2,745,745,https://www.fool.com/investing/2024/02/18/here...,,Here Why I Might Change My Mind and Buy Nvidia...,20240218T124500Z,https://g.foolcdn.com/editorial/images/764678/...,fool.com,English,United States,...,might change mind buy nvidia stock,0.000321,0.997152,0.002527,0,0.526015,1,2024-02-18,2024-02-18,0.008895
3,746,746,https://www.thenorthernecho.co.uk/news/2412397...,,Cobra AI system launched by North East financi...,20240218T074500Z,https://www.thenorthernecho.co.uk/resources/im...,thenorthernecho.co.uk,English,United Kingdom,...,cobra ai system launched north east financial ...,0.000168,0.003786,0.996046,1,0.526015,1,2024-02-18,2024-02-18,0.008895
4,747,747,https://www.livemint.com/news/india/crypto-fra...,https://www.livemint.com/news/india/crypto-fra...,Crypto fraud alert : Gurugram - based exec mee...,20240218T023000Z,https://www.livemint.com/lm-img/img/2024/02/18...,livemint.com,English,India,...,crypto fraud alert gurugram based exec meet wo...,0.999199,0.000692,0.000109,-1,0.526015,1,2024-02-18,2024-02-18,0.008895
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12693,13827,13827,https://readwrite.com/zoom-unveils-ar-vr-featu...,,Zoom unveils AR / VR features for Apple Vision...,20240130T001500Z,https://readwrite.com/wp-content/uploads/2024/...,readwrite.com,English,United States,...,zoom unveils ar vr feature apple vision pro,0.000372,0.998477,0.001151,0,0.526015,0,2024-01-30,2024-01-30,-0.007754
12694,13864,13864,https://biztoc.com/x/1bb8694f4be8d871,,Human - Friendly Addresses ? Lightspark Pushes...,20240130T001500Z,https://c.biztoc.com/p/1bb8694f4be8d871/s.webp,biztoc.com,English,,...,human friendly address lightspark push uma for...,0.000388,0.998428,0.001184,0,0.526015,0,2024-01-30,2024-01-30,-0.007754
12695,13865,13865,https://biztoc.com/x/b3fc33893859e836,,"Crypto ETPs See $500 , 000 , 000 in Institutio...",20240130T001500Z,https://c.biztoc.com/p/b3fc33893859e836/s.webp,biztoc.com,English,,...,crypto etps see 500 000 000 institutional outf...,0.000322,0.103639,0.896039,1,0.526015,0,2024-01-30,2024-01-30,-0.007754
12696,13904,13904,https://biztoc.com/x/d6414b351eeb3fa1,,Republican French Hill says he optimistic abou...,20240130T001500Z,https://c.biztoc.com/p/d6414b351eeb3fa1/s.webp,biztoc.com,English,,...,republican french hill say optimistic prospect...,0.009983,0.764699,0.225319,0,0.526015,1,2024-01-30,2024-01-30,-0.007754


In [179]:
columns_to_drop = ['Unnamed: 0.1', 'Unnamed: 0', 'url', 'url_mobile', 'socialimage', 'domain', 'date_format', 'date']

df_result = df_result.drop(columns=columns_to_drop)

In [182]:
df_result['seendate'] = pd.to_datetime(df_result['seendate'], format='%Y%m%dT%H%M%SZ')
df_result.set_index('seendate', inplace=True)

In [185]:
column_order = ['title', 'lemmetized_titles', 'sourcecountry', 'language', 'relevance_class', 
                'sentiment_class', 'strength_score', 'real_percent_change', 
                'relevance_probability', 'sentiment_negative_probability', 
                'sentiment_neutral_probability', 'sentiment_positive_probability', 
                'strength_probability']

# Reorder the columns
df_result = df_result.reindex(columns=column_order)

In [186]:
df_result.head()

Unnamed: 0_level_0,title,lemmetized_titles,sourcecountry,language,relevance_class,sentiment_class,strength_score,real_percent_change,relevance_probability,sentiment_negative_probability,sentiment_neutral_probability,sentiment_positive_probability,strength_probability
seendate,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2024-02-18 05:45:00,Man defrauded of Rs 38 lakh by woman he met on...,man defrauded r 38 lakh woman met matrimonial ...,India,English,0.0,0,1,0.008895,0.115892,0.000505,0.998322,0.001172,0.526015
2024-02-18 19:45:00,Man defrauded of Rs 38 lakh by woman he met on...,man defrauded r 38 lakh woman met matrimonial ...,India,English,0.0,0,1,0.008895,0.115892,0.000505,0.998322,0.001172,0.526015
2024-02-18 12:45:00,Here Why I Might Change My Mind and Buy Nvidia...,might change mind buy nvidia stock,United States,English,0.0,0,1,0.008895,0.083854,0.000321,0.997152,0.002527,0.526015
2024-02-18 07:45:00,Cobra AI system launched by North East financi...,cobra ai system launched north east financial ...,United Kingdom,English,0.0,1,1,0.008895,0.219234,0.000168,0.003786,0.996046,0.526015
2024-02-18 02:30:00,Crypto fraud alert : Gurugram - based exec mee...,crypto fraud alert gurugram based exec meet wo...,India,English,0.0,-1,1,0.008895,0.416067,0.999199,0.000692,0.000109,0.526015


In [189]:
df_result.to_csv('GA_Results.csv', index = True)