<a href="https://colab.research.google.com/github/beatriceyapsm/temporaltest/blob/main/20221014%20TEST.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

####Import Python Libraries and Define Global Variables

In [1]:

#Import Pandas and Numpy Python Libraries
import pandas as pd        #data analysis and manipulation library for Python
import numpy as np         #mathematical operations over arrays

#Global variables
train_data = []            #sentence for training with labels 
train_target = []          #labels for training data
test_data = []             #sentence for testing with labels  
test_target = []           #labels for testing data

#### Create data frames

In [2]:
#Import files into a dataframe
def import_text(file_txt):
    
    colnames=['ID', 'Text'] 
    
    df = pd.read_csv(file_txt
                       ,skip_blank_lines=True   #input files have emplty lines
                       ,header=None             #no haeders
                       ,sep='\t'                #tab delimited
                       ,engine='python'         #engine
                       ,quotechar='^'           #if there are doublequotes in the text
                       ,comment='Comment:'      #this is for human reader, we don't need
                       ,names=colnames          #preset columns we need
                      )
    print("Number of rows: ", len(df.index))

    return df

In [3]:
#clean ""
def clean_doublequotes(df):
    return df.replace('"', '', regex=True)

In [4]:
#BEATRICE replace all other classifications with "Other"//so we do not lose these samples
def clean_class(df):
    
    return df.replace(regex=['Product-Producer','Entity-Origin','Instrument-Agency','Component-Whole','Content-Container','Entity-Destination','Member-Collection','Message-Topic'], value='Other')

In [5]:
#clean tags e.g. (e2,e1) - to check why?
def clean_tags(df_column,braket_type):
    
    if braket_type == 'angle':
        pattern = r'<.*?>'
    elif braket_type == 'round':
        pattern = r'(\(.*?\))'
    else:
        raise Exception("angle or round")
    
    return df_column.str.replace(pattern, '', regex=True)

In [6]:
#create tuple of keys
keys = ('Other','Cause-Effect')
#keys = ('Other','Cause-Effect','Product-Producer','Entity-Origin','Instrument-Agency','Component-Whole','Content-Container','Entity-Destination','Member-Collection','Message-Topic')
#for i in range(len(keys)): print(i, keys[i])

In [7]:
def prepare_file(file_name,use):
    
    #Import training file into a dataframe
    df = import_text(file_name)

    #clean ""
    df = clean_doublequotes(df)

    #BEATRICE make binary
    df = clean_class(df)

    # combine rows 1&2
    df['Clasification'] = df['ID'].shift(-1)

    #remove leftover rows
    df = df[df.Text.notna()]

    #clean tags e.g. <e2> - better option would be to keep them and use for more precise prediction
    df.Text = clean_tags(df.Text,braket_type='angle')

    #clean tags e.g. (e2,e1) - better option would be to keep them and use for more precise prediction
    df.Clasification = clean_tags(df.Clasification,braket_type='round')

    #Map Keys to numbers
    df['Clasification_ID'] = df.Clasification.map(lambda x: keys.index(x))

    #to use in thes same file
    if use == 'train':
        global train_data
        train_data = df.Text
        
        global train_target
        train_target = df.Clasification_ID
    elif use == 'test':
        global test_data
        test_data = df.Text
        
        global test_target
        test_target = df.Clasification_ID

    #check
    print('Output rows:', len(df.index))
  
    return df

#### Prepare Training file

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
%cd /content/drive/MyDrive/Github/Group-9C-Capstone

/content/drive/MyDrive/Github/Group-9C-Capstone


In [10]:
#prepare file
df_train = prepare_file('semeval2010task8_train.txt','train')

df_train.head(5)

Number of rows:  16000
Output rows: 8000


Unnamed: 0,ID,Text,Clasification,Clasification_ID
0,1,The system as described above has its greatest...,Other,0
2,2,The child was carefully wrapped and bound into...,Other,0
4,3,The author of a keygen uses a disassembler to ...,Other,0
6,4,A misty ridge uprises from the surge.,Other,0
8,5,The student association is the voice of the un...,Other,0


In [11]:
df_test = prepare_file('semeval2010task8_test.txt','test')

df_test.head(5)

Number of rows:  5434
Output rows: 2717


Unnamed: 0,ID,Text,Clasification,Clasification_ID
0,8001,The most common audits were about waste and re...,Other,0
2,8002,The company fabricates plastic chairs.,Other,0
4,8003,The school master teaches the lesson with a st...,Other,0
6,8004,The suspect dumped the dead body into a local ...,Other,0
8,8005,Avian influenza is an infectious disease of bi...,Cause-Effect,1


## Step 4: Training Models


In [12]:
# use the TF-IDF vectorizer and create a pipeline that attaches it to a multinomial naive Bayes classifier
from sklearn.feature_extraction.text import TfidfVectorizer    #Convert a collection of raw documents to a matrix of TF-IDF features
from sklearn.naive_bayes import MultinomialNB                  #multinomial Naive Bayes classifier is suitable for classification with discrete features
from sklearn.pipeline import make_pipeline                     #Construct a Pipeline from the given estimators

#show data
import seaborn as sns; sns.set()                               #data visualization library based on matplotlib
import matplotlib.pyplot as plt                                #interactive plots 
from sklearn.metrics import confusion_matrix                   #Compute confusion matrix to evaluate the accuracy of a classification

### Train model 5: Support Vector Machine Linear Kernel with Gridsearch (2/3-gram)

####GridsearchCV

Grid-search is used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.

[Rohan Joseph, 2018](https://towardsdatascience.com/grid-search-for-model-tuning-3319b259367e)

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
#Reference: https://gist.github.com/dspp779/5a9597e2d8a2518b80fb0ad191ea8463

In [23]:
#Build a vectorizer / classifier pipeline that filters out tokens that are too rare or too frequent
pipeline = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', LinearSVC()),
])

In [24]:
#Build a grid search to find out whether 2-grams or 3-grams are more useful. 
#Fit the pipeline on the training set using grid search for the parameters
parameters = {
    'vect__ngram_range': [(1, 2), (1, 3)],
}
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1)
grid_search.fit(train_data, train_target)

GridSearchCV(estimator=Pipeline(steps=[('vect', TfidfVectorizer()),
                                       ('clf', LinearSVC())]),
             n_jobs=-1, param_grid={'vect__ngram_range': [(1, 2), (1, 3)]})

In [25]:
#Print the mean and std for each candidate along with the parameter settings for all the candidates explored by grid search.
n_candidates = len(grid_search.cv_results_['params'])
for i in range(n_candidates):
    print(i, 'params - %s; mean - %0.2f; std - %0.2f'
             % (grid_search.cv_results_['params'][i],
                grid_search.cv_results_['mean_test_score'][i],
                grid_search.cv_results_['std_test_score'][i]))

0 params - {'vect__ngram_range': (1, 2)}; mean - 0.95; std - 0.03
1 params - {'vect__ngram_range': (1, 3)}; mean - 0.95; std - 0.03


In [26]:
#Predict the outcome on the testing set and store it in a variable named y_predicted
y_predicted = grid_search.predict(test_data)

####Accuracy Rate

In [27]:
accuracy_for_test_keys = np.mean(y_predicted == test_target)
print("SVM Model Accuracy = {} %".format(accuracy_for_test_keys*100))

SVM Model Accuracy = 96.72432830327567 %


####F1-Score

In [28]:
from sklearn import metrics
print(metrics.classification_report(test_target, y_predicted, target_names=keys))
#Precision is the fraction of positive class predictions that actually belong to the positive class. 
#Recall is the fraction of positive class predictions made out of all positive examples in the dataset. 
#F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. 
#Higher F1 scores are generally better.

              precision    recall  f1-score   support

       Other       0.97      0.99      0.98      2389
Cause-Effect       0.94      0.78      0.85       328

    accuracy                           0.97      2717
   macro avg       0.96      0.89      0.92      2717
weighted avg       0.97      0.97      0.97      2717



###PULL OUT PREDICTED

In [30]:
df_test['Predicted'] = y_predicted
#BEATRICE Return Predicted Cause-Effect Statements 
df_ceff=df_test[(df_test.Predicted.eq(1))]
df_ceff.head(10)


Unnamed: 0,ID,Text,Clasification,Clasification_ID,Predicted
8,8005,Avian influenza is an infectious disease of bi...,Cause-Effect,1,1
60,8031,Of the hundreds of strains of avian influenza ...,Cause-Effect,1,1
66,8034,"Essentially, the blisters that appear in the m...",Other,0,1
78,8040,Roundworms or ascarids are caused by an intest...,Other,0,1
114,8058,Traffic vibrations on the street outside had c...,Cause-Effect,1,1
144,8073,"The slide, which was triggered by an avalanche...",Cause-Effect,1,1
164,8083,Muscle fatigue is the number one cause of arm ...,Cause-Effect,1,1
208,8105,It spilled more than 53000 gallons of crude oi...,Cause-Effect,1,1
212,8107,The diseases are caused by gene mutations on t...,Cause-Effect,1,1
214,8108,"The pretexts offered were laughable, and the r...",Cause-Effect,1,1
