# **Random Document Generator using GPT transformer model**

In [80]:
from transformers import pipeline

In [81]:
generator = pipeline('text-generation', model="openai-community/gpt2")

# **Document Generation**

In [82]:
document_1 = generator('the history of the middle east', max_length = 600,  num_return_sequences=1, truncation=True)
document_1 = document_1[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [83]:
document_2 = generator('artificial intelligence revolution', max_length = 600,  num_return_sequences=1, truncation=True)
document_2 = document_2[0]['generated_text']

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [84]:
print(document_1)
print(document_2)

the history of the middle east, which began with the Islamic invasion of Iraq. This is a historical problem that still needs to be solved or there will be a change of course."

Turkey's parliament, meanwhile, has agreed to convene a committee to report the findings of the report into corruption and other corruption issues during the Erdogan presidency. An investigative panel headed by Justice Minister Numan Kurtulmus said that the corruption has resulted in up to $100 million (£57 million) in money laundering, a charge it charged is against the prime minister.

The court case has sparked widespread speculation. For instance, a prominent figure in the city of Gaziantep, Mevlut Cavusoglu, accused a newspaper of inciting anger over the coup in July, and the Justice Minister charged a former editor, Efrem Ozbilic, with leaking information to the media.

The Justice Minister's investigation into the coup plot, and Cavusoglu's decision to leave the country last year, was seen by some critics

# **preprocessing of data**

In [85]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [86]:
def clean_data(text):
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))

    #Normalizing text to lowercase
    text = text.lower()

    #Remove symbols, characters, and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    #Tokenization
    tokens = word_tokenize(text)

    #Lemmatization and stop words cleaning
    cleaned_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]

    #Full corpus
    cleaned_text = ' '.join(cleaned_tokens)

    return cleaned_text

In [87]:
def get_unique_words(text):
    words = word_tokenize(text)
    unique_words = set(words)
    return unique_words

In [88]:
document_1_cleaned = clean_data(document_1)
document_2_cleaned = clean_data(document_2)

In [89]:
print(document_1_cleaned)
print(document_2_cleaned)

history middle east began islamic invasion iraq historical problem still need solved change course turkey parliament meanwhile agreed convene committee report finding report corruption corruption issue erdogan presidency investigative panel headed justice minister numan kurtulmus said corruption resulted million million money laundering charge charged prime minister court case sparked widespread speculation instance prominent figure city gaziantep mevlut cavusoglu accused newspaper inciting anger coup july justice minister charged former editor efrem ozbilic leaking information medium justice minister investigation coup plot cavusoglus decision leave country last year seen critic attempt protect turkish citizen prosecution ankara opposition leader called public support investigation judicial investigation thorough process turkish prime minister binali yildirim said monday threehour meeting country foreign minister
artificial intelligence revolution way find answer begun deep learning r

In [90]:
full_corpus = [document_1_cleaned, document_2_cleaned]

# **Extracting Unique words from the corpus**

In [91]:
unique_words = set()
for doc in full_corpus :
  unique_words = unique_words.union(set(doc.split()))
print(unique_words)
print(len(unique_words))

{'anger', 'protect', 'want', 'looking', 'collaborative', 'iraq', 'stage', 'resulted', 'benefit', 'valuable', 'pick', 'need', 'city', 'important', 'might', 'clear', 'mean', 'newspaper', 'value', 'imply', 'help', 'instance', 'leap', 'investigative', 'use', 'im', 'customer', 'deep', 'build', 'sparked', 'kind', 'focus', 'price', 'make', 'middle', 'obvious', 'ozbilic', 'topic', 'time', 'going', 'yildirim', 'committee', 'country', 'experience', 'judicial', 'attempt', 'simple', 'thousand', 'lot', 'part', 'decision', 'try', 'provide', 'cover', 'said', 'support', 'productive', 'talking', 'former', 'forefront', 'critic', 'course', 'meeting', 'serve', 'information', 'july', 'binali', 'data', 'primary', 'convene', 'headed', 'insight', 'speculation', 'business', 'world', 'really', 'gaziantep', 'product', 'minister', 'inciting', 'know', 'like', 'working', 'others', 'ahead', 'thats', 'create', 'develop', 'company', 'whether', 'take', 'tried', 'idea', 'threehour', 'differentiate', 'accused', 'coup', '

# **Using Scikit-learn TF-IDF Calculation (Built-In)**

In [92]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
import math

In [93]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(full_corpus)
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
print("TF-IDF Built-in DataFrame:")
print(tfidf_df)

TF-IDF Built-in DataFrame:
    accused    agreed     ahead       aim   already      also    amount  \
0  0.074951  0.074951  0.000000  0.000000  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.038388  0.038388  0.076775  0.038388  0.038388   

      anger    ankara    answer  ...      well      weve   whether  \
0  0.074951  0.074951  0.000000  ...  0.000000  0.000000  0.000000   
1  0.000000  0.000000  0.038388  ...  0.038388  0.038388  0.038388   

   widespread    within      wont   working     world      year  yildirim  
0    0.074951  0.000000  0.000000  0.000000  0.000000  0.074951  0.074951  
1    0.000000  0.038388  0.038388  0.076775  0.038388  0.000000  0.000000  

[2 rows x 231 columns]


# **Trying to implement TF-IDF From Scratch**

In [105]:
def calculate_tf(full_corpus):
    tf_df = pd.DataFrame()

    #loop over documents in the corpus
    for i, document in enumerate(full_corpus):
        words = document.split()

        #total number of words
        total_words = len(words)

        #calculating frequency of each word
        word_freq = {}
        for word in set(words):
          word_freq[word] = words.count(word) / total_words

        #each row for each document
        tf_df = tf_df.append(word_freq, ignore_index=True)

    #equal NaN = 0
    tf_df = tf_df.fillna(0)

    return tf_df

In [106]:
tf_calculation = calculate_tf(full_corpus)
tf_calculation

  tf_df = tf_df.append(word_freq, ignore_index=True)
  tf_df = tf_df.append(word_freq, ignore_index=True)


Unnamed: 0,anger,protect,iraq,resulted,need,city,newspaper,instance,investigative,sparked,...,become,wont,thing,learning,possible,background,introduce,answer,user,relevant
0,0.008696,0.008696,0.008696,0.008696,0.008696,0.008696,0.008696,0.008696,0.008696,0.008696,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.004274,0.004274,0.017094,0.004274,0.004274,0.004274,0.004274,0.004274,0.004274,0.012821


In [107]:
def calculate_idf(full_corpus):
    #total number of documents in corpus
    total_docs = len(full_corpus)

    idf_values = {}

    #loop over documents
    for document in full_corpus:
        words = set(document.split())

        #update idf_values dict {}
        for word in words:
            idf_values[word] = idf_values.get(word, 0) + 1

    #calculating idf
    for word, freq in idf_values.items():
        idf_values[word] = np.log(total_docs / 1 + freq)

    return idf_values

In [108]:
idf_calculation = calculate_idf(full_corpus)
idf_calculation

{'anger': 1.0986122886681098,
 'protect': 1.0986122886681098,
 'iraq': 1.0986122886681098,
 'resulted': 1.0986122886681098,
 'need': 1.0986122886681098,
 'city': 1.0986122886681098,
 'newspaper': 1.0986122886681098,
 'instance': 1.0986122886681098,
 'investigative': 1.0986122886681098,
 'sparked': 1.0986122886681098,
 'middle': 1.0986122886681098,
 'ozbilic': 1.0986122886681098,
 'yildirim': 1.0986122886681098,
 'committee': 1.0986122886681098,
 'country': 1.0986122886681098,
 'judicial': 1.0986122886681098,
 'attempt': 1.0986122886681098,
 'decision': 1.0986122886681098,
 'said': 1.0986122886681098,
 'support': 1.0986122886681098,
 'former': 1.0986122886681098,
 'critic': 1.0986122886681098,
 'course': 1.0986122886681098,
 'meeting': 1.0986122886681098,
 'information': 1.0986122886681098,
 'july': 1.0986122886681098,
 'binali': 1.0986122886681098,
 'convene': 1.0986122886681098,
 'headed': 1.0986122886681098,
 'speculation': 1.0986122886681098,
 'gaziantep': 1.0986122886681098,
 'mini

In [109]:
def calculate_tfidf(tf_df, idf_values):
    #empty dataframe
    tfidf_df = tf_df.copy()

    #looping over each TF and multiplying by IDF to get TF-IDF value
    for idx, row in tfidf_df.iterrows():
        for word in tfidf_df.columns:
            tfidf_df.at[idx, word] *= idf_values[word]

    return tfidf_df

In [111]:
tfidf_calculation = calculate_tfidf(tf_calculation,idf_calculation)
tfidf_calculation

Unnamed: 0,anger,protect,iraq,resulted,need,city,newspaper,instance,investigative,sparked,...,become,wont,thing,learning,possible,background,introduce,answer,user,relevant
0,0.009553,0.009553,0.009553,0.009553,0.009553,0.009553,0.009553,0.009553,0.009553,0.009553,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.004695,0.004695,0.01878,0.004695,0.004695,0.004695,0.004695,0.004695,0.004695,0.014085
