# Text mining as an example of biased predictive modeling

In this notebook, we will go thru the basics of text mining and produce a purposely-biased predictive model using machine-learning techniques. 

For this exercise, we will try to classify hotel reviews into three categories: negative, neutral and positive. Our dataset is segregated by gender: there is female-only dataset and a male-only one. To illustrate how biases can be introduced into a model. We will train our model using the male-only dataset and then test it using female only data. The discrepancy on the overall accuracy will show how a biased sample can compromise the performance and the fairness of the model. 

## How to classify documents


* Load the data
* Split the original dataset into two subsets: training(90%) and test(10%)
* Generate the model using the training data
* Assess the performance of the model using the test dataset.  Usually this is where the work ends if the result are satisfactory.
* For our experiment, we will test again with another tagged dataset and compare the result of both tests.

<hr>





![Load the data](images/get.png "Load the data") 
## <center>Load the data </center>

In [None]:
#read the file
import csv

datasetClass = []
datasetData = []

with open('male-only-3C.csv', 'r', encoding='utf-8') as csvFile:
    reader = csv.reader(csvFile,delimiter = ';')
    for row in reader:
        datasetClass.append( row[0].strip() )
        datasetData.append( row[1].strip() )
        
csvFile.close()

print("Result: ")
print (str(len(datasetClass)) + " comments have been imported" )

![Split](images/pizza.png "Split the dataset") 
## <center>Split the original dataset into two subsets: training (90%) and test (10%) </center>

In [None]:
#divide the training dataset to evaluate it
cont = 0
trainingData = []
trainingClass = []

testData = []
testClass = []

for index in range(len(datasetClass)):
    if cont % 10 == 0:
        testData.append(datasetData[index])
        testClass.append(datasetClass[index])
    else:
        trainingData.append(datasetData[index])
        trainingClass.append(datasetClass[index])
    cont = cont + 1 
    
print("Result: ")
print ("Dataset divided")

![Generate](images/generate.png "Generate") 
## <center>Generate the model using the training data</center>

In [None]:
#train the classifier model
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

text_clf = Pipeline([
     ('vect', CountVectorizer(stop_words='english',ngram_range=(1, 3))),
     ('tfidf', TfidfTransformer()),
     ('clf',MultinomialNB()),
])

r = text_clf.fit(trainingData,trainingClass)
print("Result: ")
print ("Model Generated")

![Test](images/report.png "Test") 
## <center>Assess the performance of the model using the test dataset</center>

In [None]:
#test the model
import numpy as np
from sklearn import metrics

predicted = text_clf.predict(testData)


print("--------------------------------------------------------------------------------")
print("CASE: MALE-ONLY TRAINING SET / MALE-ONLY TEST SET")
print("--------------------------------------------------------------------------------")
print("")

print("Accuracy: " + str(np.mean(predicted == testClass)))

print("")
print("")
print("CONFUSION MATRIX")
print("")
print(metrics.confusion_matrix(testClass, predicted))

![Compare](images/compare.png "Compare") 
## <center>Test again with another tagged dataset (Female-Only) </center>

In [None]:
#now lets try to predict male comments using our female-only trained model
compDatasetClass = []
compDatasetData = []

with open('female-only-3C.csv', 'r', encoding='utf-8') as csvFile:
    reader = csv.reader(csvFile,delimiter = ';')
    for row in reader:
        compDatasetClass.append( row[0].strip() )
        compDatasetData.append( row[1].strip() )
        
csvFile.close()


compTestData = []
compTestClass = []

cont = 0

for index in range(len(compDatasetData)):
    if cont % 10 == 0:
        compTestData.append(compDatasetData[index])
        compTestClass.append(compDatasetClass[index])
    cont = cont + 1   
    

compPredicted = text_clf.predict(compTestData)

print("--------------------------------------------------------------------------------")
print("CASE: MALE-ONLY TRAINING SET / FEMALE-ONLY TEST SET")
print("--------------------------------------------------------------------------------")

print("Accuracy: " + str(np.mean(compPredicted == compTestClass)))
print("")
print("")
print("CONFUSION MATRIX")
print("")
print(metrics.confusion_matrix(compTestClass, compPredicted))
