# Lab 1 - Naive Bayes Classifier

## Submission rules

1. Lab 1 is an assignment for teams of 2-3 students; the teams are listed on cms. Please make only one submission per team  
2. The assignment should be completed in a Google Collaboratory notebook (https://colab.research.google.com/notebooks/intro.ipynb#). To this end, first create a copy of this notebook in your personal Googel Drive via "File" --> "Save a copy in Drive". Do not forget to    
 *    rename the notebook and mention all your teammates in the name;      
 *    share your notebook within ucu.edu.ua domain, so that we will be able to open and grade it :)  
3. Submit the link to the final version of the notebook in the comments field of cms and list all the team members therein. No changes may be made to the notebook after the deadline
4. At the top of your notebook, provide a work-breakdown structure estimating efforts of each team member.

Failure to comply with the submission rules can be a reason of up to 1 point deduction.

## Introduction
During the past three weeks, you learned a couple of essential notions ant theorems. One of them is Bayes theorem.

One of its applications is **Naive Bayes classifier**, which is a probabilistic classifier whose aim is to determine which class some observation probably belongs by using the Bayes formula:
$$\mathsf{P}(\mathrm{class}\mid \mathrm{observation})=\frac{\mathsf{P}(\mathrm{observation}\mid\mathrm{class})\mathsf{P}(\mathrm{class})}{\mathsf{P}(\mathrm{observation})}$$

Under the strong independence assumption, one can calculate $\mathsf{P}(\mathrm{observation} \mid \mathrm{class})$ as
$$\mathsf{P}(\mathrm{observation}) = \prod_{i=1}^{n} \mathsf{P}(\mathrm{feature}_i),$$
where $n$ is the total number of features describing a given observation. Thus, $\mathsf{P}(\mathrm{class}|\mathrm{observation})$ now can be calculated as

$$\mathsf{P}(\mathrm{class} \mid \mathrm{\mathrm{observation}}) = \mathsf{P}(\mathrm{class})\times \prod_{i=1}^{n}\frac{\mathsf{P}(\mathrm{feature}_i\mid \mathrm{class})}{\mathsf{P}(\mathrm{feature}_i)}$$

For more detailed explanation, you can check [this link](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/).



## Data  description

There are 5 datasets uploaded on the cms. 

To determine your variant, take your team number from the list of teams on cms and take *mod 5* - this is the number of your data set.

* **0 - authors**
This data set consists of citations of three famous writers: Edgar Alan Poe, Mary Wollstonecraft Shelley and HP Lovecraft. The task with this data set is to classify a piece of text with the author who was more likely to write it.

* **1 - discrimination**
This data set consists of tweets that have discriminatory (sexism or racism) messages or of tweets that are of neutral mood. The task is to determine whether a given tweet has discriminatory mood or does not.

* **2 - fake news**
This data set contains data of American news: a headline and an abstract of the article.
Each piece of news is classified as fake or credible. The task is to classify the news from test.csv as credible or fake.

* **3 - sentiment**
All the text messages contained in this data set are labeled with three sentiments: positive, neutral or negative. The task is to classify some text message as the one of positive mood, negative or neutral.

* **4 - spam**
This last data set contains SMS messages classified as spam or non-spam (ham in the data set). The task is to determine whether a given message is spam or non-spam.

Each data set consists of two files: *train.csv* and *test.csv*. The first one you will need find the probabilities distributions for each of the features, while the second one is needed for checking how well your classifier works.


##Implementation

In [5]:
import pandas as pd
from typing import List
from collections import Counter
from string import punctuation
from nltk.stem import PorterStemmer

ps = PorterStemmer()
df = pd.read_csv("train.csv")
df

Unnamed: 0,sentiment,text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...
...,...,...
3963,neutral,Under the agreement Benefon 's forthcoming ran...
3964,neutral,"Under the terms of the agreement , Bunge will ..."
3965,neutral,"Under the transaction agreement , Metsaliitto ..."
3966,neutral,Underground parking facilities will also be bu...


### Data pre-processing
* Read the *.csv* data files with *pandas* package. This package will also provide you with a nice interface for data processing even within the classifier implementation.
* Сlear your data from punctuation or other unneeded symbols.
* Clear you data from stop words. You don’t want words as is, and, or etc. to affect your probabilities distributions, so it is a wise decision to get rid of them. Find list of stop words in the cms under the lab task.
* Represent each test message as its bag-of-words. [Here](https://machinelearningmastery.com/gentle-introduction-bag-words-model/) you can find general introduction to the bag-of-words model and examples on to create it.

In [2]:
# your code here
def get_stop_words() -> set:
    """
    Returns set of words to ignore
    """
    words_file_path = "stop_words.txt"
    stop_words = set()
    with open(words_file_path) as f:
        for line in f:
            stop_words.add(line.strip())
    return stop_words

def prepare_sentence(sentence: str) -> str:
    sentence = str(sentence).lower()
    
    for char in punctuation:
        sentence = sentence.replace(char, "")
        
    words = sentence.split()
    stop_words = get_stop_words()
    counter = Counter()
    for word in words:
        if word not in stop_words:
            stem = ps.stem(word)
            counter[stem] += 1
            
    return counter
            


In [7]:
def process_data(data_file):
    """
    Function for data processing and split it into X and y sets.
    :param data_file: str - train datado a research of your own
    :return: pd.Series|list, pd.Series|list - X and y data series or lists
    """
    
    df = pd.read_csv(data_file)
    df["text"] = df["text"].apply(prepare_sentence)
    
    
    return list(df["text"].values), list(df["sentiment"].values)
    
result = process_data("train.csv")
result[0]

[Counter({'accord': 1,
          'gran': 1,
          'compani': 2,
          'plan': 1,
          'move': 1,
          'product': 1,
          'russia': 1,
          'although': 1,
          'grow': 1}),
 Counter({'technopoli': 1,
          'plan': 1,
          'develop': 1,
          'stage': 1,
          'area': 1,
          'less': 1,
          '100000': 1,
          'squar': 1,
          'meter': 1,
          'order': 1,
          'host': 1,
          'compani': 1,
          'work': 1,
          'comput': 1,
          'technolog': 1,
          'telecommun': 1,
          'statement': 1,
          'said': 1}),
 Counter({'intern': 1,
          'electron': 1,
          'industri': 1,
          'compani': 2,
          'elcoteq': 1,
          'laid': 1,
          'ten': 1,
          'employe': 1,
          'tallinn': 1,
          'facil': 1,
          'contrari': 1,
          'earlier': 1,
          'layoff': 1,
          'contract': 1,
          'rank': 1,
          'offic': 1,
       

In [25]:
all_words_count = dict()
for counter in result[0]:
    for element in counter:
        all_words_count[element] = {"positive":0,"neutral":0,"negative":0}


for i in range(len(result[1])):
    sentiment =  result[1][i]
    for word in result[0][i]:
        all_words_count[word][sentiment]+=1
all_words_count
all_words_prob = dict()
sentiments = ["positive","neutral","negative"]
for word in all_words_count:
    all_words_prob[word] = dict()
    for sentiment in sentiments:
        all_words_prob[word][sentiment] = all_words_count[word][sentiment]/(all_words_count[word]["positive"]+all_words_count[word]["neutral"]+all_words_count[word]["negative"])
all_words_prob


{'accord': {'positive': 0.26548672566371684,
  'neutral': 0.7079646017699115,
  'negative': 0.02654867256637168},
 'gran': {'positive': 0.5, 'neutral': 0.5, 'negative': 0.0},
 'compani': {'positive': 0.3206806282722513,
  'neutral': 0.6492146596858639,
  'negative': 0.030104712041884817},
 'plan': {'positive': 0.3181818181818182,
  'neutral': 0.6818181818181818,
  'negative': 0.0},
 'move': {'positive': 0.38461538461538464,
  'neutral': 0.6153846153846154,
  'negative': 0.0},
 'product': {'positive': 0.2979591836734694,
  'neutral': 0.6857142857142857,
  'negative': 0.0163265306122449},
 'russia': {'positive': 0.3684210526315789,
  'neutral': 0.5921052631578947,
  'negative': 0.039473684210526314},
 'although': {'positive': 0.5, 'neutral': 0.5, 'negative': 0.0},
 'grow': {'positive': 0.5, 'neutral': 0.5, 'negative': 0.0},
 'technopoli': {'positive': 0.125, 'neutral': 0.875, 'negative': 0.0},
 'develop': {'positive': 0.25806451612903225,
  'neutral': 0.7311827956989247,
  'negative': 0.

*   If you need to implement some additional methods, feel free to do it.

31634  - symbol values [".", ",", "(", ")", ":", "&","-", "\"","[","]", "?"]

30458 - replacing all punctuation

23106 - using stemmer

### Implementation
Implement each method of the BayesianClassifier 
created according to its description.

In [None]:
class BayesianClassifier:
    """
    Implementation of Naive Bayes classification algorithm.
    """
    def __init__(self):
        self.model: pd.DataFrame = None
        self.lbl_properties: pd.DataFrame = None

    def fit(self, X, y):
        """
        Fit Naive Bayes parameters according to train data X and y.
        :param X: pd.DataFrame|list - train input/messages
        :param y: pd.DataFrame|list - train output/labels
        :return: None
        """
        record = {'word': ['difficult', 'exercise', 'play', 'football'],
                   'positive': [0.09, 0.3, 0.12, 0.9],
                   'neutral': [0.2, 0.5, 0.01, 0.3],
                   'negative': [0.2, 0.5, 0.01, 0.15],
                   'total': [0.25, 0.25, 0.25, 0.25]}
        
        lbl_properties = {'label': ['positive', 'neutral', 'negative'],
                          'prob': [0.2, 0.3, 0.7]}
        
        self.model = pd.DataFrame(record, columns = ['word', 'positive', 'neutral', 'negative', 'total'])
        self.lbl_properties = pd.DataFrame(lbl_properties, columns = ['label', 'words_count', 'prob'])

    def predict_prob(self, message, label):
        """
        Calculate the probability that a given label can be assigned to a given message.
        :param message: str - input message
        :param label: str - label
        :return: float - probability P(label|message)
        """
        model = self.model
        lbl_props = self.lbl_properties
        
        multiplication = 1
        for word, count in message.items():
            word_info = model.loc[model['word'] == word]
            if len(word_info) != 0:
                word_info = word_info.iloc[0]
                word_cond = word_info[label]
                word_tot = word_info['total']
            else:
                # probability of word out of base
                # doesn't affect in comparison
                word_cond = 1
                word_tot = 1
#                 word_cond = (0 + 1) / (lbl_props[lbl_props['label'] == label].iloc[0]['words_count'] + lbl_props[lbl_props['label'] == 'total'].iloc[0]['words_count'])
#                 word_tot = 1 / lbl_props[lbl_props['label'] == 'total'].iloc[0]['words_count']
            multiplication *= (word_cond / word_tot) ** count
        
        prob = lbl_props[lbl_props['label'] == label].iloc[0]['prob'] * multiplication
        
        return prob
        

    def predict(self, message):
        """
        Predict label for a given message.
        :param message: str - message
        :return: str - label that is most likely to be truly assigned to a given message
        """
        lbl_props = self.lbl_properties
        lbls = lbl_props[lbl_props['label'] != 'total']['label']
        
        probs = {}
        for lbl in lbls:
            probs[lbl] = self.predict_prob(message, lbl)
        probs = list(probs.items())
        probs.sort(key=lambda x: x[1], reverse=True)
        
        return probs[0][0]

    def score(self, X, y):
        """
        Return the mean accuracy on the given test data and labels - the efficiency of a trained model.
        :param X: pd.DataFrame|list - test data - messages
        :param y: pd.DataFrame|list - test labels
        :return:
        """
        cor_mes = 0
        all_mes = len(X.index)
        for idx, message in X.items():
            pred = self.predict(message)
            real = y.at[idx]
            if pred == real:
                cor_mes += 1

        return cor_mes / all_mes

# test
    
classifier = BayesianClassifier()
classifier.fit(None, None)
print(classifier.predict(Counter({'difficult': 1, 'football': 1, 'chicccken': 1, 'Alps': 3})))
print(classifier.predict(Counter({'play': 1, 'chicccken': 1, 'Alps': 3})))

### Testing
*  Finally, after you are done with your classifier, test it.

In [None]:
train_X, train_y = None, None
# test_X, test_y = process_data("your test data file")

classifier = BayesianClassifier()
classifier.fit(train_X, train_y)
# classifier.predict_prob(Counter({'play': 1, 'football': 1, 'chicccken': 1}), 'positive')
classifier.predict(Counter({'difficult': 1, 'football': 1, 'chicccken': 1, 'Alps': 3}))

# print("model score: ", classifier.score(test_X, test_y))

## Conclusions

Summarize your work by explaining in a few sentences the points listed below




* ### Describe the method implemented in general:


* ### List pros and cons of the method:

* ### Add a few sencences about your implementation of the classifier:


* ### Describe your results:
