<a href="https://colab.research.google.com/github/feliciahf/data_science_exam/blob/main/XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XG Boost Model

From this article: https://suatatan.com/posts/sklearn_xgboost_tc/

In [1]:
#Import relevant packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

##The data

In [2]:
#import csv file as dataframe (from GitHub repo)

url = 'https://raw.githubusercontent.com/feliciahf/data_science_exam/main/hippoCorpusV2.csv'
df = pd.read_csv(url, encoding='latin1', delimiter=",")
# remove columns with uninformative information (AssignmentId, WorkerId, recAgnPairId, recImgPairId)
uninformative_cols = ["AssignmentId", "WorkerId", "recAgnPairId", "recImgPairId"]
df = df.drop(columns=uninformative_cols)
df = df[df.memType != 'retold']
df 

Unnamed: 0,WorkTimeInSeconds,annotatorAge,annotatorGender,annotatorRace,distracted,draining,frequency,importance,logTimeSinceEvent,mainEvent,memType,mostSurprising,openness,similarity,similarityReason,story,stressful,summary,timeSinceEvent
0,1641,25.0,man,white,1.0,1.0,,3.0,4.499810,attending a show,imagined,when I got concert tickets,0.000,3.0,"I've been to a couple concerts, but not many.","Concerts are my most favorite thing, and my bo...",1.0,My boyfriend and I went to a concert together ...,90.0
1,1245,25.0,woman,white,1.0,1.0,3.0,4.0,4.499810,a concert.,recalled,we saw the beautiful sky.,1.000,,,"The day started perfectly, with a great drive ...",1.0,My boyfriend and I went to a concert together ...,90.0
2,1159,35.0,woman,black,1.0,1.0,,4.0,5.010635,my sister having her twins a little early,imagined,she went into labor early,0.500,3.0,I am a mother myself,It seems just like yesterday but today makes f...,1.0,My sister gave birth to my twin niece and neph...,150.0
3,500,30.0,woman,white,1.0,4.0,3.0,5.0,5.010635,meeting my twin niece and nephew.,recalled,finding out they were healthy.,1.000,,,"Five months ago, my niece and nephew were born...",2.0,My sister gave birth to my twin niece and neph...,150.0
4,1074,25.0,man,white,2.0,2.0,,3.0,3.401197,the consequences of going to burning man,imagined,When I don't answer the phone in case I owe th...,0.250,4.0,Because I also have money problems,About a month ago I went to burning man. I was...,4.0,It is always a journey for me to go to burning...,30.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6849,926,30.0,woman,other,3.0,5.0,3.0,5.0,5.010635,losing and finding a pet.,recalled,the kitten ran into my arms.,0.125,,,My dog was diagnosed with lymphoma a year ago ...,5.0,"My dog, who had lymphoma, was suffering so I h...",150.0
6850,3044,18.0,woman,asian,4.0,2.0,4.0,2.0,6.345636,about a vacation event worked on,recalled,when i encountered an guy who was really scared,-0.500,,,"Over my vacation from my job, I went to Casper...",5.0,"On vacation, a side job was taken to plan an e...",570.0
6851,1008,35.0,man,asian,1.0,2.0,2.0,4.0,3.044522,my nephew's birthday party,recalled,a lot of people got in the pool.,0.500,,,This event was a birthday party for my nephew....,2.0,This was a birthday party for my nephew that h...,21.0
6852,1462,30.0,man,hisp,1.0,1.0,3.0,3.0,2.639057,my cousin's birthday,recalled,my cousin threw a tantrum in the middle of the...,0.500,,,This event occurred about two weeks ago. I was...,2.0,It was my little cousin's birthday and went to...,14.0


In [3]:

# make labels column using numerical values
df.memType = pd.Categorical(df.memType)
df['label'] = df.memType.cat.codes

# story type corresponding to label
print(f"Label 0: {df.loc[df['label'] == 0,'memType'].unique()}")
print(f"Label 1: {df.loc[df['label'] == 1,'memType'].unique()}")

Label 0: ['imagined']
Categories (1, object): ['imagined']
Label 1: ['recalled']
Categories (1, object): ['recalled']


In [4]:
cv = CountVectorizer(max_features=5000, encoding="utf-8",  
      ngram_range = (1,3),  
      token_pattern = "[A-Za-z_][A-Za-z\d_]*")
X = cv.fit_transform(df.story).toarray()
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, 
      test_size=0.33, 
      random_state=0)
count_df = pd.DataFrame(X_train, columns=cv.get_feature_names())
count_df['label'] = y_train

df.sample(5)

Unnamed: 0,WorkTimeInSeconds,annotatorAge,annotatorGender,annotatorRace,distracted,draining,frequency,importance,logTimeSinceEvent,mainEvent,memType,mostSurprising,openness,similarity,similarityReason,story,stressful,summary,timeSinceEvent,label
6809,1446,55.0,man,white,1.0,3.0,3.0,4.0,3.73767,Getting together with my family on Labor Day w...,recalled,I sensed that my teenage niece is a little afr...,1.0,,,I saw my 15 year-old niece with my brother and...,3.0,"I got together with my mother, brother, his wi...",42.0,1
1335,1054,25.0,man,white,1.0,1.0,,5.0,4.787492,dating my best friend.,imagined,the best friend had feelings.,0.375,4.0,I've had similar experiences. Mine didn't turn...,I always wanted to be with someone. I saw peop...,1.0,I realized i feelings for by best friend. I to...,120.0,0
2859,1264,30.0,woman,white,2.0,1.0,4.0,5.0,4.787492,Moving away from my comfort zone.,recalled,I ended up enjoying the move. I thought I woul...,-0.125,,,"I have lived in one state my entire life, neve...",3.0,My husband took a position within his company ...,120.0,1
477,2689,30.0,man,na,1.0,1.0,,3.0,4.094345,Vegas tour and nightlife,imagined,Drinking in the streets,1.0,4.0,I like nightlife,"Two months ago, as we got to the Vegas strip, ...",1.0,The experience we had on this tour was somethi...,60.0,0
1712,1254,35.0,woman,white,1.0,1.0,,3.0,4.143135,my son opening up a new restaurant.,imagined,I drove up to the business and it was packed w...,1.0,1.0,I didn't identify with this story because I do...,"Nine weeks ago, I got to witness first hand th...",1.0,My son followed his dream and opened up his ow...,63.0,0


In [8]:
# fit model to training data
model = XGBClassifier()
model.fit(X_train, y_train)

# how well model does on training data
yhat = model.predict(X_train)
train_pred = [round(value) for value in yhat]
acc_train = accuracy_score(y_train, train_pred)
print("Accuracy on train data: %.2f%%" % (acc_train * 100.0))

# make predictions on test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# evaluate predictions on test data
accuracy = accuracy_score(y_test, predictions)
print("Accuracy on test data: %.2f%%" % (accuracy * 100.0))

Accuracy on train data: 81.85%
Accuracy on test data: 69.18%


In [12]:
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.67      0.75      0.71       915
           1       0.71      0.64      0.67       912

    accuracy                           0.69      1827
   macro avg       0.69      0.69      0.69      1827
weighted avg       0.69      0.69      0.69      1827

