<a href="https://colab.research.google.com/github/ajits-github/Machine_Learning_Concepts_without_libraries/blob/main/Machine_Learning_Concepts_without_libraries.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Bag of Words(BoW)

First, we will see with scikit learn.

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Sample movie reviews and their corresponding sentiments
reviews = [
    ("I loved the movie, it was amazing!", "positive"),
    ("The movie was terrible, I didn't like it.", "negative"),
    ("The plot was intriguing and well-developed.", "positive"),
    ("The acting was subpar and disappointing.", "negative"),
]

# Separate text and labels
texts, labels = zip(*reviews)

# Create a Bag of Words representation using CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)

# Print the vocabulary (unique words)
print("Vocabulary (Unique Words):")
print(vectorizer.get_feature_names_out())

# Print the Bag of Words representation
print("\nBag of Words Representation:")
print(X.toarray())

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=42)

# Build a Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("\nAccuracy:", accuracy)


Vocabulary (Unique Words):
['acting' 'amazing' 'and' 'developed' 'didn' 'disappointing' 'intriguing'
 'it' 'like' 'loved' 'movie' 'plot' 'subpar' 'terrible' 'the' 'was' 'well']

Bag of Words Representation:
[[0 1 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0]
 [0 0 0 0 1 0 0 1 1 0 1 0 0 1 1 1 0]
 [0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 1 1]
 [1 0 1 0 0 1 0 0 0 0 0 0 1 0 1 1 0]]

Accuracy: 0.0


In [None]:
# X.toarray()
X.get_shape()

(4, 17)

In bag of words, the words are tokenized and a dictionary is prepared from the data (including both -train and test). The dictionary is then ordered from A-Z alphabetically and a binary feature vector is prepared for each of the records present in the data. Hence, finally a matrix is obtained where each row's length is equal to number of words in the dictionary and 1 is inserted for the words which are found in that record while iterating through the dictionary of the words. So if total numer of records are 10 and total number of words after tokenization is 25, then the size of the matrix is 10*25 i.e. each row with 25 columns with values in {1,0}.

Now, we will try without using the library.

In [2]:
def tokenize(text):
    return text.lower().split()

def count_words(text):
    word_count = {}
    tokens = tokenize(text)
    for word in tokens:
        word_count[word] = word_count.get(word, 0) + 1
    return word_count

def create_bow(corpus):
    word_index = {}
    bow = []
    for doc in corpus:
        word_count = count_words(doc)
        for word in word_count.keys():
            if word not in word_index:
                word_index[word] = len(word_index)
        bow.append(word_count)
    return bow, word_index

def vectorize(bow, word_index):
    num_documents = len(bow)
    num_words = len(word_index)
    vectorized_bow = np.zeros((num_documents, num_words), dtype=int)
    for i, word_count in enumerate(bow):
        for word, count in word_count.items():
            word_idx = word_index[word]
            vectorized_bow[i, word_idx] = count
    return vectorized_bow

# Sample text documents
documents = [
    "Jane has 2 apples. She bought 3 more. How many apples does she have in total?",
    "John also has apples. He picked 5 apples from the tree.",
    "Jane and John both love apples."
]

# Create Bag of Words
bow, word_index = create_bow(documents)

# Vectorize the Bag of Words
vectorized_bow = vectorize(bow, word_index)

# Print the Bag of Words and word_index
print("Bag of Words:")
print(vectorized_bow)
print("\nWord Index:")
print(word_index)


Bag of Words:
[[1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0]
 [1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1]]

Word Index:
{'jane': 0, 'has': 1, '2': 2, 'apples.': 3, 'she': 4, 'bought': 5, '3': 6, 'more.': 7, 'how': 8, 'many': 9, 'apples': 10, 'does': 11, 'have': 12, 'in': 13, 'total?': 14, 'john': 15, 'also': 16, 'he': 17, 'picked': 18, '5': 19, 'from': 20, 'the': 21, 'tree.': 22, 'and': 23, 'both': 24, 'love': 25}


###TF-IDF

TF-IDF stands for Term Frequency-Inverse Document Frequency. It is a numerical representation used to represent the importance of a word in a document relative to a collection of documents (corpus).

The TF-IDF weight for a word in a specific document is calculated using the following formula:

TF-IDF = (Term Frequency in Document) * (Inverse Document Frequency)

1. Term Frequency (TF): It represents how frequently a word appears in a document. It is calculated by dividing the number of times a word occurs in a document by the total number of words in that document.

   TF = (Number of occurrences of word in document) / (Total number of words in document)

2. Inverse Document Frequency (IDF): It represents the inverse proportion of documents that contain the word in the entire corpus. It is calculated by dividing the total number of documents in the corpus by the number of documents that contain the word, and then taking the logarithm of the result.

   IDF = log((Total number of documents) / (Number of documents containing the word))

The TF-IDF weight of a word is the product of TF and IDF, which gives a higher weight to words that appear frequently in a specific document but are rare in the rest of the corpus. This way, common words that appear in many documents (like "the" and "and") receive lower weights, while important and distinctive words in a specific document receive higher weights.

The TF-IDF representation helps in capturing the uniqueness of each document in a corpus and is widely used in text mining, information retrieval, and natural language processing tasks.

To calculate TF-IDF without using scikit-learn, you can follow these steps and implement the calculation from scratch:
* Calculate Term Frequency (TF) for each word in each document.
* Calculate Document Frequency (DF) for each word in the entire corpus.
* Calculate Inverse Document Frequency (IDF) for each word based on DF.
* Compute the TF-IDF weight for each word in each document.

In [None]:
import math

def calculate_tf(word, document):
    total_words = len(document.split())
    word_count = document.split().count(word)
    return word_count / total_words

def calculate_df(word, corpus):
    return sum(1 for doc in corpus if word in doc)

def calculate_idf(word, corpus):
    total_docs = len(corpus)
    doc_freq = calculate_df(word, corpus)
    return math.log(total_docs / (doc_freq + 1))  # Adding 1 to avoid division by zero

def calculate_tfidf(word, document, corpus):
    tf = calculate_tf(word, document)
    idf = calculate_idf(word, corpus)
    return tf * idf

# Sample movie reviews and their corresponding sentiments
reviews = [
    "I loved the movie, it was amazing!",
    "The movie was terrible, I didn't like it.",
    "The plot was intriguing and well-developed.",
    "The acting was subpar and disappointing.",
]

# Calculate TF-IDF for each word in each document
corpus = reviews
tfidf_matrix = []
for doc in corpus:
    doc_tfidf = {}
    for word in doc.split():
        doc_tfidf[word] = calculate_tfidf(word, doc, corpus)
    tfidf_matrix.append(doc_tfidf)

# Print the TF-IDF representation for each document
for i, doc_tfidf in enumerate(tfidf_matrix):
    print(f"TF-IDF Representation for Document {i+1}:")
    for word, tfidf in doc_tfidf.items():
        print(f"{word}: {tfidf:.4f}")
    print()
