#**`NB3_StratifiedSample_TF_IDF : Performing startified sample analysis to make accurate comparison between TF-IDF and Hugging`**
* **`This will ensure that the size of the data set utilised does not affect the metrics`**

##**`Load The dataset`**

In [1]:
#mount gdrive
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
#Reading the data from google drive
data=pd.read_csv('gdrive/MyDrive/SEA820_DataSets/AI_Human.csv')

In [5]:
df = data.copy()

In [12]:
df = df.rename(columns={'generated': 'label'})
df['label'] = df['label'].astype('int64')

##**`Obtaining a sample of 5000 entries while maintaining the AI vs Human ratio`**

In [13]:
df_sample = (
    df.groupby('label', group_keys=False)
      .apply(lambda g: g.sample(frac=5000/len(df), random_state=42))
      .reset_index(drop=True)
)

  .apply(lambda g: g.sample(frac=5000/len(df), random_state=42))


In [14]:
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    5000 non-null   object
 1   label   5000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 78.3+ KB


In [15]:
#using a smaller sample for faster model training maintaining the ratio of ai vvd human entries
df_sample['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,3138
1,1862


##**`Preprocessing`**

In [16]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import sent_tokenize
import string
from nltk.tokenize import word_tokenize
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [17]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

In [23]:
#Additional Imports
import nltk
from nltk.corpus import stopwords
from collections import  Counter

In [25]:
#acquire all stop words in english
nltk.download('stopwords')
stop=set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [26]:
i=0
def preprocess_text(text, method='lemma'):
  #tokenize into words
  word_tokens = word_tokenize(text)
  #convert to lower case
  tokens = [w.lower() for w in word_tokens]
  #Remove punctuation and stopwords (use NLTK's English stopwords list).
  #remove punctuation and stop words
  stop_words = set(stopwords.words('english'))
  tokens = [tok for tok in tokens if tok.lower() not in stop and tok not in string.punctuation]
  #remove other characters which are not necessary
  tokens = [tok for tok in tokens if re.fullmatch(r'[A-Za-z]+', tok)]
  if method == 'lemma':
    #Apply WordNet Lemmatization to the tokens.
    lemmatizer = WordNetLemmatizer()
    processed= [lemmatizer.lemmatize(w, pos='n') for w in tokens]
  else:
    #Apply Porter Stemming to the tokens.
    stemmer = PorterStemmer()
    processed= [stemmer.stem(w) for w in tokens]
  single_str = ' '.join(processed) #return string joined by spaces
  global i
  i += 1
  print(f"Processing row {i}")
  return single_str

In [27]:
copy_df= df_sample.copy()
copy_df.head()

Unnamed: 0,text,label
0,Do curfews keep teenagers from Getting into tr...,0
1,"In this article ""The Challenge of Exploring Ve...",0
2,With THP rapid growth of THP Internet in recen...,0
3,The electoral College is the way Us United Sta...,0
4,This technology of you can calculate the emoti...,0


In [28]:
copy_df['processed_lemmatized'] = copy_df['text'].apply(preprocess_text, method="lemma")

Processing row 1
Processing row 2
Processing row 3
Processing row 4
Processing row 5
Processing row 6
Processing row 7
Processing row 8
Processing row 9
Processing row 10
Processing row 11
Processing row 12
Processing row 13
Processing row 14
Processing row 15
Processing row 16
Processing row 17
Processing row 18
Processing row 19
Processing row 20
Processing row 21
Processing row 22
Processing row 23
Processing row 24
Processing row 25
Processing row 26
Processing row 27
Processing row 28
Processing row 29
Processing row 30
Processing row 31
Processing row 32
Processing row 33
Processing row 34
Processing row 35
Processing row 36
Processing row 37
Processing row 38
Processing row 39
Processing row 40
Processing row 41
Processing row 42
Processing row 43
Processing row 44
Processing row 45
Processing row 46
Processing row 47
Processing row 48
Processing row 49
Processing row 50
Processing row 51
Processing row 52
Processing row 53
Processing row 54
Processing row 55
Processing row 56
P

In [29]:
copy_df.head()

Unnamed: 0,text,label,processed_lemmatized
0,Do curfews keep teenagers from Getting into tr...,0,curfew keep teenager getting trouble city coun...
1,"In this article ""The Challenge of Exploring Ve...",0,article challenge exploring venus author sugge...
2,With THP rapid growth of THP Internet in recen...,0,thp rapid growth thp internet recent decade th...
3,The electoral College is the way Us United Sta...,0,electoral college way u united state citizen v...
4,This technology of you can calculate the emoti...,0,technology calculate emotion people pretty muc...


##**`TF-IDF on the preprocessed data`**

In [30]:
from sklearn.model_selection import train_test_split

#Split your data into training and testing sets.
#Use the processed_lemmatized column as your input feature text and the 'label' column as your target
input = copy_df['processed_lemmatized']
target = copy_df['label']

In [31]:
#Use an 80/20 split (test_size=0.2) and set random_state for reproducibility.
X_train, X_test, y_train, y_test = train_test_split(input, target, test_size=0.2, random_state=42)

In [32]:
print("X_train:\n", X_train)
print("\nX_test:\n", X_test)
print("\ny_train:\n", y_train)
print("\ny_test:\n", y_test)

X_train:
 4227    nowadays korea think many young people give en...
4676    facial action coding system facs computer syst...
800     luke merger member oc seagoing cowboy program ...
3671    hey average grader bear essay school board pla...
4193    believe imagination important knowledge imagin...
                              ...                        
4426    title face mar mysterious sight captured imagi...
466     summer project notorious extremely uninteresti...
3092    dhe limitation human contacted due use technol...
3772    failure inevitable part life seen good bad mai...
860     yes identify churchill statement u reach want ...
Name: processed_lemmatized, Length: 4000, dtype: object

X_test:
 1501    face face mar natural landform although people...
2586    limiting car usage would bring ridiculous amou...
2653    lot gook worker going best fink new job sank b...
1055    use facial action bonding system would valuabl...
705     may concern student attend class home way onli

In [33]:
#Initialize a TfidfVectorizer (or CountVectorizer - choose one).
from sklearn.feature_extraction.text import TfidfVectorizer

In [34]:
#Fit the vectorizer on the training data (X_train) only.
tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(X_train)

In [35]:
#Transform both the training data (X_train) and the testing data (X_test) using the fitted vectorizer.
X_train_vec = tfidf_matrix
X_test_vec = tfidf_vec.transform(X_test)

In [36]:
#Choose a simple classifier, such as Naive Bayes
from sklearn.naive_bayes import MultinomialNB

In [37]:
#Train the classifier using the vectorized training data and corresponding training labels (y_train).
clf = MultinomialNB()
clf.fit(X_train_vec, y_train)

In [38]:
#Make predictions on the vectorized test data (X_test_vec).
predictions = clf.predict(X_test_vec)
print(predictions)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1
 1 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0
 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 0 1
 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0
 1 0 1 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 0 1 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 0 0
 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1
 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0
 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1
 1 0 1 1 0 0 0 0 0 0 0 0 

In [39]:
#Calculate and print the accuracy of your classifier using sklearn.metrics.accuracy_score.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Accuracy: 0.87
Precision: 1.00
Recall: 0.65
F1 Score: 0.79
