# Lab 1 - Naive Bayes Classifier

## Submission rules

1. Lab 1 is an assignment for teams of 2-3 students; the teams are listed on cms. Please make only one submission per team
2. The assignment should be completed in a Google Collaboratory notebook (https://colab.research.google.com/notebooks/intro.ipynb#). To this end, first create a copy of this notebook in your personal Googel Drive via "File" --> "Save a copy in Drive". Do not forget to    
 *    rename the notebook and mention all your teammates in the name;      
 *    share your notebook within ucu.edu.ua domain, so that we will be able to open and grade it :)  
3. Submit the link to the final version of the notebook in the comments field of cms and list all the team members therein. No changes may be made to the notebook after the deadline
4. At the top of your notebook, provide a work-breakdown structure estimating efforts of each team member.

Failure to comply with the submission rules can be a reason of up to 1 point deduction.

## Introduction
During the past three weeks, you learned a couple of essential notions ant theorems. One of them is Bayes theorem.

One of its applications is **Naive Bayes classifier**, which is a probabilistic classifier whose aim is to determine which class some observation probably belongs by using the Bayes formula:
$$\mathsf{P}(\mathrm{class}\mid \mathrm{observation})=\frac{\mathsf{P}(\mathrm{observation}\mid\mathrm{class})\mathsf{P}(\mathrm{class})}{\mathsf{P}(\mathrm{observation})}$$

Under the strong independence assumption, one can calculate $\mathsf{P}(\mathrm{observation} \mid \mathrm{class})$ as
$$\mathsf{P}(\mathrm{observation}) = \prod_{i=1}^{n} \mathsf{P}(\mathrm{feature}_i),$$
where $n$ is the total number of features describing a given observation. Thus, $\mathsf{P}(\mathrm{class}|\mathrm{observation})$ now can be calculated as

$$\mathsf{P}(\mathrm{class} \mid \mathrm{\mathrm{observation}}) = \mathsf{P}(\mathrm{class})\times \prod_{i=1}^{n}\frac{\mathsf{P}(\mathrm{feature}_i\mid \mathrm{class})}{\mathsf{P}(\mathrm{feature}_i)}$$

For more detailed explanation, you can check [this link](https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/).



## Data  description

There are 5 datasets uploaded on the cms. 

To determine your variant, take your team number from the list of teams on cms and take *mod 5* - this is the number of your data set.

* **0 - authors**
This data set consists of citations of three famous writers: Edgar Alan Poe, Mary Wollstonecraft Shelley and HP Lovecraft. The task with this data set is to classify a piece of text with the author who was more likely to write it.

* **1 - discrimination**
This data set consists of tweets that have discriminatory (sexism or racism) messages or of tweets that are of neutral mood. The task is to determine whether a given tweet has discriminatory mood or does not.

* **2 - fake news**
This data set contains data of American news: a headline and an abstract of the article.
Each piece of news is classified as fake or credible. The task is to classify the news from test.csv as credible or fake.

* **3 - sentiment**
All the text messages contained in this data set are labeled with three sentiments: positive, neutral or negative. The task is to classify some text message as the one of positive mood, negative or neutral.

* **4 - spam**
This last data set contains SMS messages classified as spam or non-spam (ham in the data set). The task is to determine whether a given message is spam or non-spam.

Each data set consists of two files: *train.csv* and *test.csv*. The first one you will need find the probabilities distributions for each of the features, while the second one is needed for checking how well your classifier works.


##Implementation

In [34]:
import pandas as pd
import re

### Data pre-processing
* Read the *.csv* data files with *pandas* package. This package will also provide you with a nice interface for data processing even within the classifier implementation.
* Сlear your data from punctuation or other unneeded symbols.
* Clear you data from stop words. You don’t want words as is, and, or etc. to affect your probabilities distributions, so it is a wise decision to get rid of them. Find list of stop words in the cms under the lab task.
* Represent each test message as its bag-of-words. [Here](https://machinelearningmastery.com/gentle-introduction-bag-words-model/) you can find general introduction to the bag-of-words model and examples on to create it.

In [37]:
def process_data(data_file):
    """
    Function for data processing and split it into X and y sets.
    :param data_file: str - train datado a research of your own
    :return: pd.DataFrame|list, pd.DataFrame|list - X and y data frames or lists
    """
    df = pd.read_csv(data_file)
    regex = re.compile('[^a-zA-Z ]')
    df['text'] = df['text'].apply(lambda x: regex.sub('', x).lower())
#     print(df)
    return(df['text'])
    
print(process_data('./data/0-authors/test.csv'))


0       she became excessively intimate with most of t...
1       as he walked among other men he seemed encompa...
2       i beg both your pardons but i cant be so much ...
3       now he was no less a fanatic but his desire to...
4       i see by your eagerness and the wonder and hop...
                              ...                        
5869    there was a simple natural earnestness about h...
5870    as if by some sudden convulsive exertion reaso...
5871    we passed over innumerable vessels of all kind...
5872    at the college we used an incinerator but the ...
5873    outside the moon played on the ridgepole of th...
Name: text, Length: 5874, dtype: object


*   If you need to implement some additional methods, feel free to do it.

### Implementation
Implement each method of the BayesianClassifier 
created according to its description.

In [None]:
class BayesianClassifier:
    """
    Implementation of Naive Bayes classification algorithm.
    """
    def __init__(self):
        pass

    def fit(self, X, y):
        """
        Fit Naive Bayes parameters according to train data X and y.
        :param X: pd.DataFrame|list - train input/messages
        :param y: pd.DataFrame|list - train output/labels
        :return: None
        """
        pass

    def predict_prob(self, message, label):
        """
        Calculate the probability that a given label can be assigned to a given message.
        :param message: str - input message
        :param label: str - label
        :return: float - probability P(label|message)
        """
        pass

    def predict(self, message):
        """
        Predict label for a given message.
        :param message: str - message
        :return: str - label that is most likely to be truly assigned to a given message
        """
        pass

    def score(self, X, y):
        """
        Return the mean accuracy on the given test data and labels - the efficiency of a trained model.
        :param X: pd.DataFrame|list - test data - messages
        :param y: pd.DataFrame|list - test labels
        :return:
        """
        pass

### Testing
*  Finally, after you are done with your classifier, test it.

In [None]:
train_X, train_y = process_data("your train data file")
test_X, test_y = process_data("your test data file")

classifier = BayesianClassifier()
classifier.fit(train_X, train_y)
classifier.predict_prob(test_X[0], test_y[0])

print("model score: ", classifier.score(test_X, test_y))

## Conclusions

Summarize your work by explaining in a few sentences the points listed below




* ### Describe the method implemented in general:


* ### List pros and cons of the method:

* ### Add a few sencences about your implementation of the classifier:


* ### Describe your results:
