# python librarires for training dataset
- numpy (np): For numerical operations and array handling.
- tensorflow (tf): For building and training neural network models.
- train_test_split: For splitting datasets into training and - testing sets.
- Tokenizer: For converting text data into numerical sequences.
- pad_sequences: For padding sequences to ensure uniform length.
- Counter: For counting element occurrences in lists.

In [None]:
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from collections import Counter

# get dataset from txt. file
- load_sentences function: This function takes a file path as input, reads the content of the file, and stores each line (stripped of leading and trailing whitespaces) as an element in a list. It then returns the list of sentences.
- data: The code loads sentences from a file named 'data_ki.txt' using the defined function and stores them in a NumPy array.

In [5]:
# function to load sentences from a text file
def load_sentences(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        sentences = [line.strip() for line in file]
    return sentences

# load data to array
data = load_sentences('data_ki.txt')
data = np.array(data)

# print the first 10 sentences
print("Example Sentences:")
print(data[:5])

Example Sentences:
['Mahallenin iklimi, karasal iklim et ki alanı içerisindedir'
 'Sonra öğrenir ki gün ışığında yürüyebilen vampirler ölüleri diriltebilir'
 'Köyün iklimi, Karadeniz iklimi et ki alanı içerisindedir'
 "Ne yazık ki Muhammed'in ailesi gitarını buldu ve parçaladı"
 "Yayla mahallesi Kocaali'nin en es ki yerleşim alanlarındandır"]


# generate labels for the sentences based on their position
- Labels generation: It creates labels for the sentences using list comprehension. Sentences with even indices are labeled as 1 (suffix "ki's"), and those with odd indices are labeled as 0 (conjunction "ki's").
- Conversion to NumPy array: It converts the generated labels list into a NumPy array for further processing.

In [7]:
# label the sentences (suffix de's are labeled as 1, conjunction de's are labeled as 0)
labels = [1 if i % 2 == 0 else 0 for i in range(1000)]
labels = np.array(labels)

# count the occurrences of unique words
- counter_word function: This function takes a collection of text data (text_col) as input. It initializes a Counter object to store word counts. Then, it iterates over each text in the collection, splits it into words, and updates the count for each word in the Counter object. Finally, it returns the Counter object containing the word counts.
- counter: The function is called with the data array (presumably containing sentences) to count the occurrences of unique words in the dataset.

In [8]:
# count unique words
def counter_word(text_col):
    count = Counter()
    for text in text_col:
        for word in text.split():
            count[word] += 1
    return count

counter = counter_word(data)
print(len(counter))
counter.most_common(5)

3949


[('ki', 1000),
 ('et', 131),
 ('alanı', 129),
 ('içerisindedir', 128),
 ('iklimi,', 124)]

# set parameters for preprocessing the data
- max_words: It determines the maximum number of words to tokenize. In this case, it appears to be set to the total number of unique words in the dataset, obtained from the previously calculated counter object.
- max_len: It sets the maximum length of sentences after tokenization. In this case, sentences longer than 7 words will be truncated, and shorter sentences will be padded with zeros to match this length.

In [9]:
# preprocess the data
max_words = len(counter)  # maximum number of words to tokenize
max_len = 7  # maximum length of sentences

max_words

3949

# tokenizes all words in the data and pads sequences
- Tokenizer initialization: The Tokenizer object is initialized with the num_words parameter set to max_words, which determines the maximum number of words to tokenize.
- Fitting tokenizer on data: The Tokenizer's fit_on_texts method is called with the data array to update the tokenizer's internal vocabulary based on the text data.
- Converting texts to sequences: The texts_to_sequences method of the tokenizer is used to convert each text in the data to a sequence of integers based on the tokenizer's vocabulary. Each word in the text is replaced by its corresponding integer index in the tokenizer's word index.
- Padding sequences: The pad_sequences function is used to ensure that all sequences have the same length (max_len). Sequences longer than max_len are truncated, and sequences shorter than max_len are padded with zeros at the beginning.

In [10]:
# tokenize all words in the data
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(data)
sequences = tokenizer.texts_to_sequences(data)
padded_sequences = pad_sequences(sequences, maxlen=max_len)

# split the data into training and test sets
- Calculating split size: It calculates the index (train_size) where the split between training and test data should occur. This index corresponds to 80% of the total data length.
- Splitting sequences and labels: It splits the padded sequences (padded_sequences) and labels (labels) arrays into training and test sets using array slicing. The first train_size elements are assigned to the training set, while the remaining elements are assigned to the test set.
- Assignment: The training sequences and labels are assigned to train_sentences and train_labeled variables, respectively, while the test sequences and labels are assigned to test_sentences and test_labeled variables.

In [11]:
# split data into training and test sentences (%80 of data as training, %20 of data as test)
train_size = int(len(data) * 0.8)
train_sentences = padded_sequences[:train_size]
train_labeled = labels[:train_size]
test_sentences = padded_sequences[train_size:]
test_labeled = labels[train_size:]

# defines a neural network model using TensorFlow's Keras API
- Model architecture: It defines a sequential model using tf.keras.Sequential(), which allows for building models layer by layer. The model consists of the following layers:
    - Embedding layer: Converts integer-encoded words into dense vectors of fixed size (64 dimensions in this case). It expects sequences of integers as input and has a vocabulary size of max_words. The input_length parameter specifies the length of input sequences.
    - LSTM layer: Long Short-Term Memory (LSTM) layer with 64 units and ReLU activation function. LSTMs are a type of recurrent neural network (RNN) capable of learning long-term dependencies in sequence data.
    - Dense layer: Output layer with a single neuron and sigmoid activation function, suitable for binary classification tasks.
- Compilation: It compiles the model using the compile() method, specifying the loss function (binary_crossentropy for binary classification), optimizer (adam), and evaluation metric (accuracy).

In [12]:
# build model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(max_words, 64, input_length=max_len),
    tf.keras.layers.LSTM(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# train the defined model using the training data and evaluates its performance
- Training the model: The fit() method is called on the model with the following parameters:
    - train_sentences and train_labeled: The training sequences and their corresponding labels.
    - epochs=10: The number of training epochs, set to 10 in this case, meaning the entire training dataset will be passed forward and backward through the neural network 10 times during training.
    - validation_data=(test_sentences, test_labeled): The validation data to evaluate the model's performance after each epoch. It consists of the test sequences and their labels.
- History object: The fit() method returns a History object (history) containing information about the training process, such as loss and accuracy metrics recorded at each epoch.

In [13]:
history = model.fit(train_sentences, train_labeled, epochs=10, validation_data=(test_sentences, test_labeled))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


# evaluate the trained model on the test data to calculate its loss and accuracy
- Model evaluation: The evaluate() method is called on the model with the test sequences (test_sentences) and their corresponding labels (test_labeled) as input.
- Test loss and accuracy: The method returns the test loss and accuracy, which are assigned to the variables test_loss and test_acc, respectively.

In [14]:
test_loss, test_acc = model.evaluate(test_sentences, test_labeled)
print("Test Accuracy:", test_acc)

Test Accuracy: 0.7900000214576721


# analyze sentences and making predictions using the trained model
- analyze_sentence function: This function takes a sentence as input, splits it into words, and analyzes each word to identify certain suffix pattern ("ki"). If a word meets the criteria, it separates the suffix and root and appends them to a list. Finally, it joins the analyzed words back into a sentence and returns it.
- predict function: This function takes a sentence and an optional threshold (default is 0.5) as input. It tokenizes and pads the sentence, predicts the label and score using the trained model, and returns the prediction label along with the prediction score.
- Sentences for prediction: A list of sample sentences is provided for prediction.
- Prediction loop: Each sentence from the list is passed through the analyze_sentence function to preprocess it and then through the predict function for prediction. The prediction label and score are printed for each sentence.

In [15]:
# Step 7: Make Predictions
def analyze_sentence(sentence):
    words = sentence.split()
    analyzed_words = []

    for word in words:
        if len(word) >= 3 and (word.endswith("ki")):
            suffix = word[-2:]
            root = word[:-2]
            analyzed_words.append(root + " " + suffix)
        else:
            analyzed_words.append(word)

    analyzed_sentence = " ".join(analyzed_words)
    return analyzed_sentence

def predict(sentence, threshold=0.5):
    sequence = tokenizer.texts_to_sequences([sentence])
    padded_sequence = pad_sequences(sequence, maxlen=max_len)
    prediction_score = model.predict(padded_sequence)[0][0]
    if prediction_score >= threshold:
        return "'ki' is suffix so this sentence is true written", prediction_score
    else:
        return "'ki' is conjunction so this sentence is false written", prediction_score

sentences = ["Dışarıya çıkacaktıki çantasını unuttuğunu hatırladı.", "Alışveriş yapamadığım için evdeki malzemelerle yetindim.", "Fırındaki kurabiyeleri çıkartmayı unutmuş.",
             "Yüzü kitaplardaki tasvirden farklıydı.", "Buradaki elmaları kim yedi.", "Atatürk odaya teşrif etmiştiki etrafı sessizlik sardı.",
             "Ne yazıkki istediği tatili yapamadı.", "Olayın sonuçları önümüzdeki yıllarda etkisini gösterdi.", "Öyleki hala buralarda dolaşmaktadır.",
             "Öyle yorulmuşki yerinden doğrulamadı."]

for sentence in sentences:
    prediction_label, prediction_score = predict(analyze_sentence(sentence))
    print("Prediction Label:", prediction_label)
    print("Prediction Score:", prediction_score)

Prediction Label: 'ki' is conjunction so this sentence is false written
Prediction Score: 0.008117323
Prediction Label: 'ki' is conjunction so this sentence is false written
Prediction Score: 0.00013546406
Prediction Label: 'ki' is conjunction so this sentence is false written
Prediction Score: 0.008117323
Prediction Label: 'ki' is conjunction so this sentence is false written
Prediction Score: 0.008117323
Prediction Label: 'ki' is conjunction so this sentence is false written
Prediction Score: 0.0037590903
Prediction Label: 'ki' is conjunction so this sentence is false written
Prediction Score: 0.0034737624
Prediction Label: 'ki' is conjunction so this sentence is false written
Prediction Score: 1.2593313e-12
Prediction Label: 'ki' is suffix so this sentence is true written
Prediction Score: 0.98599774
Prediction Label: 'ki' is conjunction so this sentence is false written
Prediction Score: 4.4509605e-08
Prediction Label: 'ki' is conjunction so this sentence is false written
Predictio