# Naïve Bayes
## Introduction to Naive Bayes
Naive Bayes is a probabilistic machine learning algorithm based on the Bayes' Theorem, used for classification tasks. it assumes that features in a dataset are mutually independent, which is a strong assumption. 

For each class and feature, the method calculates the mean and variance, which will be used to make probability estimates based on the Gaussian distribution. For each instance to be classified, we calculate the posterior probability for every class using the Bayes formula.
Assume it is independence between features. So, for multiple features, the posterior probability is the product of individual probabilities of each feature. Then we compare the posterior probabilities across classes and assign the class with the highest probability to the instance, and the class with highest probability is our prediction

There are three types of Naive Bayes Classifier: Gaussian, Multinomial, Bernoulli. They share common principles but differ in handling the type of data they are applied to.

Gaussian Naive Bayes: Gaussian is best used in cases where features are continuous and can be assumed to have a Gaussian distribution. It assumes that features follow a normal distribution. Instead of using discrete counts, it uses the mean and variance of the features to estimate the probabilities. If the features are continuous, it assumes that these features are sampled from a Gaussian distribution (bell curve).

Multinomial Naive Bayes: Multinomial is primarily used for document classification problems where the features are related to word counts or frequencies within the documents. This model is based on frequency counts. It calculates the likelihood of each outcome based on the frequency count of the features. The probabilities are then estimated for the new instance using these counts. It can handle the frequency of occurrences of outcomes in a dataset and is particularly useful for text classification.

Bernoulli Naive Bayes: Bernoulli is suitable for datasets where features are binary or boolean, such as text classification where the presence or absence of a feature is more informative than frequency counts. It works similarly to the Multinomial Naive Bayes but with binary variables. It uses the Bernoulli distribution and assumes all our features are binary such that they take only two values.

In [23]:
import requests

api_key = '4ce9d5ba9ae046d8b647f0993345ef7f'

url = 'https://newsapi.org/v2/everything'
params = {
    'q': 'Apple', 
    'from': '2023-05-08', 
    'to': '2023-11-01', 
    'sortBy': 'publishedAt', 
    'apiKey': api_key, 
}
response = requests.get(url, params=params)

if response.status_code == 200:
    data = response.json()
    articles = data['articles']
    df_articles = pd.DataFrame(articles)
    print(df_articles.head())
else:
    print(f"Failed to fetch news: {response.status_code}")

columns_of_interest = ['source', 'author', 'title', 'description', 'url', 'publishedAt', 'content']
df_articles_filtered = df_articles[columns_of_interest]
print(df_articles_filtered.head())

df_articles_filtered.to_csv("AAPL_NEWS.csv")

Failed to fetch news: 426
                                              source  \
0                 {'id': None, 'name': 'Biztoc.com'}   
1  {'id': None, 'name': 'Investor's Business Daily'}   
2               {'id': None, 'name': 'Applech2.com'}   
3               {'id': None, 'name': 'Applech2.com'}   
4               {'id': None, 'name': 'Applech2.com'}   

                      author  \
0              thestreet.com   
1  Investor's Business Daily   
2                   applech2   
3                   applech2   
4                   applech2   

                                               title  \
0  One tech startup has found a solution to in-pe...   
1  Dow Jones Futures: What To Do After Big Market...   
2  AmazonでSatechiのMacBook Pro (14/16インチ)用ハードシェルケー...   
3  macOS 14.1 SonomaではWindow Serverのバグにより、Adobe P...   
4  エレコム、macOS 14 Sonomaでジェスチャー機能を割り当てした際、アプリが強制終了...   

                                         description  \
0  As much as 2021 was synonymous with an enormou..

In [33]:
import pandas as pd

AAPL = pd.read_csv('AAPL_Cleaned.csv')
MSFT = pd.read_csv('MSFT_Cleaned.csv')
GOOGL = pd.read_csv('GOOGL_Cleaned.csv')
AMZN = pd.read_csv('AMZN_Cleaned.csv')
META = pd.read_csv('META_Cleaned.csv')
TSLA = pd.read_csv('TSLA_Cleaned.csv')

In [34]:
AAPL['label'] = (AAPL['close'] > AAPL['close'].shift(1)).astype(int)
MSFT['label'] = (MSFT['close'] > MSFT['close'].shift(1)).astype(int)
GOOGL['label'] = (GOOGL['close'] > GOOGL['close'].shift(1)).astype(int)
AMZN['label'] = (AMZN['close'] > AMZN['close'].shift(1)).astype(int)
META['label'] = (META['close'] > META['close'].shift(1)).astype(int)
TSLA['label'] = (TSLA['close'] > TSLA['close'].shift(1)).astype(int)

In [35]:
from sklearn.model_selection import train_test_split

X = AAPL.drop(['timestamp', 'label'], axis=1)
y = AAPL['label']

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


Split data into train and test to ensure that the model can be tuned and evaluated properly. The training set creates the model, the validation set tunes the model’s hyperparameters, and the testing set provides a final metric of how well the model is expected to perform on unseen data.

## Feature Selection for record data

In [37]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

""" AAPL = AAPL.drop('timestamp', axis=1)
AAPL = AAPL.dropna() """

X = AAPL.drop('close', axis=1)
y = AAPL['close']

model = LinearRegression()
rfe = RFE(estimator=model, n_features_to_select=3) 
selector = rfe.fit(X, y)

selected_features = feature_names[selector.support_]
print("Selected features:")
print(selected_features)

Selected features:
Index(['open', 'high', 'low'], dtype='object')
