# I - Naive Bayes

## 1) With the help of pandas tools, identify the volume and the dimension of the data.


In [23]:
import pandas as pd
import numpy as np

In [7]:
# import csv file as a dataframe object
data = pd.read_csv("reviews_by_course.csv")

# first approach
nb_samples, nb_fields = data.shape
print(f"{nb_samples} records of {nb_fields} fields")

# print a short amount of data
data.head()

# describe dataset with basic information (number of rows, mean, std ...)
# data.describe()

# count rows per column name
# data.count()

# count rows per column name (even none values)
# data.fillna("", inplace=True)
# data.count()

140320 records of 3 fields


Unnamed: 0,CourseId,Review,Label
0,2-speed-it,BOring,1
1,2-speed-it,Bravo !,5
2,2-speed-it,Very goo,5
3,2-speed-it,"Great course - I recommend it for all, especia...",5
4,2-speed-it,One of the most useful course on IT Management!,5


As we can see, the total dimension of this dataframe is 140320 rows per 3 columns.


# 2) Target classes are the ratings, from 1 to 5. Determine the number of instances for each class. What could be the problem during the learning process ?

In [8]:
# first and naive approach : --> storing the result in a list of dataframes, filter each class and print the length.
print("first approach : ")
classes = []
for i in range(1,6):
    filter = data["Label"] == i
    classes.append(data[filter])
    print("classes",i,"has",len(data[filter]), "rows")
    
# second approach with pandas group by
print("\nsecond approach : ")
# for the whole columns
classes2 = data.groupby(["Label"]).size()
# for each columns
# classes2 = data.groupby(["Label"]).count()
classes2

first approach : 
classes 1 has 2867 rows
classes 2 has 2554 rows
classes 3 has 5923 rows
classes 4 has 22460 rows
classes 5 has 106516 rows

second approach : 


Label
1      2867
2      2554
3      5923
4     22460
5    106516
dtype: int64

As we can see, the 5th class has a huge amount of data related to it. While, other classes don't even represent 1/2 of the number of rows in the 5th class.

It can be a problem since, in order to predict a rating of 5, the model will be really accurate, but for the other classes, the model won't have enought data to correctly deal with them.

# 3) Each instance consists in a short paragraph of text, containing the review, and the course tag.

and

# 4) The volume of the data could be quite (too) large, depending on your computer. Together with your answer of Question 2), and using the functions head(), generate a meaningful subset of the data.

In [9]:
# Transforming sentence as vectors using the Bag of Word method
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data.head()["Review"]).toarray()
print(X)
Y = vectorizer.fit_transform(data.head()["CourseId"]).toarray()
print(Y)

[[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1]
 [1 1 0 0 1 1 1 1 0 1 2 0 1 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 0 1 1 0]]
[[1 1]
 [1 1]
 [1 1]
 [1 1]
 [1 1]]


Here we only tried with the first five lines.

But we could easily consider that, regarding of the number of rows and words in all those sentences, the vocabulary should be a huge number and will probably throw to a run out of memory.

Thus, we choose to only take the 200 first rows of each instance in order to reduce the size and moreover, to make the 5th class not over representated.

In [10]:
# getting the two column as vector representation
review_vectors = vectorizer.fit_transform(pd.concat([i.head(200)["Review"] for i in classes])).toarray()
courseid_vectors = vectorizer.fit_transform(pd.concat([i.head(200)["CourseId"] for i in classes])).toarray()
Y = list(pd.concat([i.head(200)["Label"] for i in classes]))

# creating a new dataframe with all the data in it
df = pd.DataFrame()
df.insert(value=pd.Series(list(review_vectors), index=range(0,len(review_vectors))), column="CountVectorReview", loc=0)
df.insert(value=pd.Series(list(courseid_vectors), index=range(0,len(courseid_vectors))), column="CountVectorCourseId", loc=1)
df.insert(value=pd.Series(Y, index=range(0,len(Y))), column="Label", loc=2)
df[199:201]

Unnamed: 0,CountVectorReview,CountVectorCourseId,Label
199,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1
200,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",2


# 5) Using a naive Bayes approach, we are going to learn model whose objective is to predict the rating, given the tag and the review. What kind of probabilistic distribution should we use ?

We choose to use a multinomial distribution because the feature vector is a histogram counting the number of event i (have the word i) was observed in a particular review. Multinomial distribution is particulary adapted to model the probability of any particular combination of numbers of successes (here precense of the given word n times) for the various label. 

- [see Multinomial naive bayesian classifier]https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Bernoulli_naive_Bayes

In [25]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix

In [15]:
# Instanciate the classifier
mnb_classifier = MultinomialNB()
# Concatenate the vector Xr and Xc
# Zip function concatenate two iterables (!but drop one line if they don't have the same length!)
X = [list(v1) + list(v2) for v1, v2 in zip(df["CountVectorReview"], df["CountVectorCourseId"])]
Y = df["Label"]

# We split our dataset in two part a learning part called training set use to learn the model
# and an evaluation part test set use to evaluate it.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)
# We learn the model with the trainning set
mnb_classifier.fit(X_train, Y_train)
# We recover the prediction for the trainning set
prediction = mnb_classifier.predict(X_test)

In [19]:
res = accuracy_score(ytest, ypred)
print(f"Accuracy: {res:.2f}")

Accuracy: 0.55


### Analysis of the results :

In classification, there are three main indicators which can helps to interprets the performance of our model :

![Precisionrecall.png](attachment:Precisionrecall.png)

* Recall : Represents true positives (the set of items correctly labeled as belonging to the positive class) among true positives and false positives (the set of items labeled as belonging to the positive class but which doesn't). Precision can be interpreted as the indicator which tells if there are more revelant result than irrevelant.


* Precision : Represents true positives among true positives and false negatives (the set of items labeled as not belonging to the class but which actually does). Recall can be interpreted as the indicator which tells if there are most of the revelant result.


* F-measure : is the harmonic mean of the last two indicators. ((sum(n**-1)/len(N))**-1)


* Accuracy represents the precision over all classes.

In [28]:
prec, rec, f1, support = precision_recall_fscore_support(ytest, ypred)
for k in range(0, len(prec)):
    print(f"class {k} precision: {prec[k-1]:.2f} recall: {rec[k-1]:.2f} f_measure: {f1[k-1]:.2f}")
print(f"Accuracy: {np.mean(prec):.2f} +/-({np.std(prec):.2f})\nallclass recall: {np.mean(rec):.2f} +/-({np.std(rec):.2f})\nf_measure: {np.mean(f1):.2f} +/-({np.std(f1):.2f})")
print("confusion matrix:\n", confusion_matrix(ytest, ypred))

class 1 precision: 0.39 recall: 0.38 f_measure: 0.38
class 2 precision: 0.38 recall: 0.37 f_measure: 0.38
class 3 precision: 0.32 recall: 0.34 f_measure: 0.33
class 4 precision: 0.79 recall: 0.62 f_measure: 0.70
class 5 precision: 0.78 recall: 0.97 f_measure: 0.87
Accuracy: 0.53 +/-(0.21)
allclass recall: 0.54 +/-(0.24)
f_measure: 0.53 +/-(0.21)
confusion matrix:
 [[12 11  8  0  1]
 [14 16 10  2  1]
 [ 3 10 12  6  4]
 [ 2  4  8 31  5]
 [ 0  1  0  0 39]]
