# 1. Business Understanding (İşin Anlaşılması):

Basit bir metin üzerinde Doğal Dil İşleme uygulaması gerçekleştireceğiz.

Burada istenilen konuların pekiştirilmesi amaçlanmaktadır.

> O konular da şu şekildedir:

Metin üzerinde veri ön işleme aşamalarının gerçekleştirilmesi amaçlanmaktadır.

Bu ön işleme adımları şu şekilde olabilir:

-- Fazla boşlukların silinmesi, 

-- Büyük küçük harf dönüşümleri

-- Noktalama işaretlerinin kaldırılması (Removing Punctuation)

-- Özel karakterlerin kaldırılması

-- Yazım hatalarının düzeltilmesi

-- HTML ve URL temizleme işlemleri


> Önemli Notlar:

<mark><em>Removing Stopwords</em></mark> → Genellikle metin içerisinde anlamı çok az olan yani etkisi çok sınırlı, edat, bağlaç vb. cümlenin ögeleridir. "the","with", "on", "at" etc. örnek verilebilir.


Sonraki aşamalardan birisi de

> <mark><em>Tokenizasyon</em></mark>: Metni daha küçük parçalara ayırma işlemidir.

Bunlar bildiğiniz üzere farklı şekilde ayrılabilir.

Kelime kelime, cümle cümle veya karakter karakter gibi ayrımlar yapılabilir.


Bir diğer kavram olarak da karşımıza Lemmatization <mark><b>(Gövdeleme)</b></mark> ve <mark><b>Stemming (Kök Bulmak)</b></mark> çıkmaktadır:

Bildiğiniz üzere dilimizde de kök, gövde gibi kavramlar bulunmaktadır.İlkokul veya ortaokul yıllarımıza dönersek:

=> Yapım eki almamış ve başka bir sözcükle birleşmemiş sözcükler köktür. 

Örneğin sev

=> Yapım eki almış ya da başka bir sözcükle birleşmiş sözcükler gövdedir.

Örneğin sevgi


İlerleyen aşamalarda da:

- Named Entity Recognition (NER) → Kişi adları, yer adları, kurumlar vb. tespitinin gerçekleştirilmesi

- Part-of-Speech (POS) Tagging → Her kelimenin türünün belirlenmesi (isim mi, fiil mi, sıfat mı, zarf mı...)

- Word Frequency Count → En sık geçen kelimelerin tespit edilmesi

- Text Visualization → Word cloud, bar chart vb. görselleştirme teknikleri veya araçlarıyla kelimelerin görselleştirilmesi



In [1]:
# import required libraries (Gerekli kutuphane kurulumlari)

# for linear algebra (lineer cebir icin)
import numpy as np

# for data manipulation (veri manipulasyonu)
import pandas as pd

# for data visualization
import matplotlib.pyplot as plt
import seaborn as sns #

# for NLP
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')


# for word preprocessing
from textblob import Word, TextBlob

# for word visualization
from wordcloud import WordCloud 

# set warning options 
from warnings import filterwarnings
filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Asus\AppData\Roaming\nltk_data...


# 2. Data Understanding (Verinin Anlaşılması):
NLP (Natural Language Processing = Doğal Dil İşleme) alanına benim gibi yeni başlayanlar için faydalı olacağını düşündüğüm bir konferans konuşmasının analiz edilmesini içermektedir.

Emily Watson tarafından gerçekleştirilen konferansta özetle, doğal dil işleme kütüphaneleri sayesinde çok fazla finansal güce sahip olunmasının gerekli olmadığı, katıldığı çeşitli etkinliklerde yapay zekanın gelecekteki işbirlik fırsatlarına yönelik tartışmalarda bulunduğu ve bunun çeşitli platformlarda etki bıraktığı olarak özetleyebilirim.

In [None]:
example_str = '''On January 3rd, 2023, Dr. Emily Watson, a senior data scientist at GreenAI Inc., gave a keynote speech at the International Conference on Artificial Intelligence in Paris, France. During her talk, she emphasized the importance of ethical AI and data privacy, citing recent cases of misuse in various industries.

She mentioned that over 3.2 million users were affected by a data breach last year, resulting in damages estimated at $12.5 million. Furthermore, she highlighted the role of open-source libraries, such as spaCy and NLTK, in democratizing access to natural language processing tools. According to her, students and researchers can now build high-quality NLP models without needing large financial resources.

"AI is not just about machines," she said, "it's about how we interact with technology in a human-centered way." After the session, attendees from universities like Stanford, MIT, and Oxford approached her to discuss future collaboration opportunities.

At 5:45 PM, she posted a summary of her speech on Twitter, receiving over 8,000 likes and 1,200 retweets within a few hours. Her tweet included hashtags like #AIethics, #DataPrivacy, and #NLPtools.

The event concluded with a panel discussion moderated by Mr. John Lee, a journalist from TechWorld Weekly, who asked, “How can governments regulate AI without stifling innovation?''' 

# 3. Data Preparing and Preprocessing (Verinin Hazırlanması ve Önişlenmesi):
Verimizi, işin anlaşılması kısmında da dile getirdiğimiz gibi makinenin anlayabileceği bir formata getirmemiz gerekmektedir. 

Makineler, 0 ve 1'lerden anlar diye boşuna demiyorlar:)

In [4]:
example_str_len = len(example_str)
print(f"Data size without pre-processing (Veri önişleme öncesi boyutu): {example_str_len}")

Data size without pre-processing (Veri önişleme öncesi boyutu): 1353


In [None]:
# 3.1. Convert all to lowercase (Hepsini kucuk harfe donustur)
lower_example_str = example_str.lower()
print("Converted String:\n", lower_example_str)
print("*"*100,"\n")
len_lower_example_str = len(lower_example_str)
print(f"Length of Lower Example: {len_lower_example_str}")

Converted String:
 on january 3rd, 2023, dr. emily watson, a senior data scientist at greenai inc., gave a keynote speech at the international conference on artificial intelligence in paris, france. during her talk, she emphasized the importance of ethical ai and data privacy, citing recent cases of misuse in various industries.

she mentioned that over 3.2 million users were affected by a data breach last year, resulting in damages estimated at $12.5 million. furthermore, she highlighted the role of open-source libraries, such as spacy and nltk, in democratizing access to natural language processing tools. according to her, students and researchers can now build high-quality nlp models without needing large financial resources.

"ai is not just about machines," she said, "it’s about how we interact with technology in a human-centered way." after the session, attendees from universities like stanford, mit, and oxford approached her to discuss future collaboration opportunities.

at 5:4

In [None]:
# 3.2. Remove punctuations (Ozel isaretlerin kaldirilmasi)
from string import punctuation

print(f"Punctuations: {punctuation}")


#trimmed_str = lower_example_str.strip()
#trimmed_str

Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


"""

Some Basic RegEx (Bazı Basit Regular Expression'lar)
<mark>regex (regular expressions)</mark>

    ^ => starts with (baslar)

    $ => ends with (biter)
    
    \w => Returns a match where the string contains any word characters 
        (characters from a to Z, digits from 0-9, and the underscore _ character) (alfanumerikleri (0-9, a-Z) döndürür)    
    
    \W => 	Returns a match where the string DOES NOT contain any word characters
    
    \s =>	Returns a match where the string contains a white space character (Space yani bosluk karakterini ifade eder)
    
    [] => 	A set of characters example 	"[a-m]" => Bir karakter setini ifade eder. Örnekte de a'dan m'ye kadar anlamına gelir.
"""

In [19]:
import re 

non_punctuation_str = re.sub(r"[^\w\s]", "",lower_example_str)
print(f"String Without Punctuation Marks (Noktalama Isaretsiz Metin):\n\n {non_punctuation_str}")

String Without Punctuation Marks (Noktalama Isaretsiz Metin):

 on january 3rd 2023 dr emily watson a senior data scientist at greenai inc gave a keynote speech at the international conference on artificial intelligence in paris france during her talk she emphasized the importance of ethical ai and data privacy citing recent cases of misuse in various industries

she mentioned that over 32 million users were affected by a data breach last year resulting in damages estimated at 125 million furthermore she highlighted the role of opensource libraries such as spacy and nltk in democratizing access to natural language processing tools according to her students and researchers can now build highquality nlp models without needing large financial resources

ai is not just about machines she said its about how we interact with technology in a humancentered way after the session attendees from universities like stanford mit and oxford approached her to discuss future collaboration opportunities

In [20]:
len_non_puctuation_str = len(non_punctuation_str)
print(f"Length of string without punctuation marks (Noktalama Isaretsiz Metnin Uzunlugu): {len_non_puctuation_str}")

Length of string without punctuation marks (Noktalama Isaretsiz Metnin Uzunlugu): 1298


In [None]:
# 3.3. Remove numbers (Sayilari kaldir)
numberless_str = re.sub(r"\d+", "", non_punctuation_str)
print("text with non-numeric punctuation removed (Noktalama Isaretleri ve Sayilar Kaldirilmis Metin): \n")
print(numberless_str)
print("*"*150,"\n")
len_numberless_str = len(numberless_str)
print(f"Length of Numberless String: {len_numberless_str}")

text with non-numeric punctuation removed (Noktalama Isaretleri ve Sayilar Kaldirilmis Metin): 

on january rd  dr emily watson a senior data scientist at greenai inc gave a keynote speech at the international conference on artificial intelligence in paris france during her talk she emphasized the importance of ethical ai and data privacy citing recent cases of misuse in various industries

she mentioned that over  million users were affected by a data breach last year resulting in damages estimated at  million furthermore she highlighted the role of opensource libraries such as spacy and nltk in democratizing access to natural language processing tools according to her students and researchers can now build highquality nlp models without needing large financial resources

ai is not just about machines she said its about how we interact with technology in a humancentered way after the session attendees from universities like stanford mit and oxford approached her to discuss future coll

In [None]:
# 3.4. Removing extra spaces (Fazla olan boşlukların temizlenmesi):
without_extra_spaces_str = re.sub(r'\s+', ' ', numberless_str)
print(f"Without Extra Spaces String:\n\n {without_extra_spaces_str}")

print("*"*150)
len_without_extra_spaces = len(without_extra_spaces_str)
print(f"Length of Without Extra Spaces String: {len_without_extra_spaces}")

Without Extra Spaces String:

 on january rd dr emily watson a senior data scientist at greenai inc gave a keynote speech at the international conference on artificial intelligence in paris france during her talk she emphasized the importance of ethical ai and data privacy citing recent cases of misuse in various industries she mentioned that over million users were affected by a data breach last year resulting in damages estimated at million furthermore she highlighted the role of opensource libraries such as spacy and nltk in democratizing access to natural language processing tools according to her students and researchers can now build highquality nlp models without needing large financial resources ai is not just about machines she said its about how we interact with technology in a humancentered way after the session attendees from universities like stanford mit and oxford approached her to discuss future collaboration opportunities at pm she posted a summary of her speech on twi

In [None]:
# Check (Kontrol edelim, basta ve sonda bosluk kaldi mi)
without_extra_spaces_str = without_extra_spaces_str.strip()
print(len(without_extra_spaces_str))

1267


In [None]:
splitted_str = without_extra_spaces_str.split()
print(f"Splitted String:\n\n {splitted_str}")

len_splitted_str = len(splitted_str)

Splitted String:

 ['on', 'january', 'rd', 'dr', 'emily', 'watson', 'a', 'senior', 'data', 'scientist', 'at', 'greenai', 'inc', 'gave', 'a', 'keynote', 'speech', 'at', 'the', 'international', 'conference', 'on', 'artificial', 'intelligence', 'in', 'paris', 'france', 'during', 'her', 'talk', 'she', 'emphasized', 'the', 'importance', 'of', 'ethical', 'ai', 'and', 'data', 'privacy', 'citing', 'recent', 'cases', 'of', 'misuse', 'in', 'various', 'industries', 'she', 'mentioned', 'that', 'over', 'million', 'users', 'were', 'affected', 'by', 'a', 'data', 'breach', 'last', 'year', 'resulting', 'in', 'damages', 'estimated', 'at', 'million', 'furthermore', 'she', 'highlighted', 'the', 'role', 'of', 'opensource', 'libraries', 'such', 'as', 'spacy', 'and', 'nltk', 'in', 'democratizing', 'access', 'to', 'natural', 'language', 'processing', 'tools', 'according', 'to', 'her', 'students', 'and', 'researchers', 'can', 'now', 'build', 'highquality', 'nlp', 'models', 'without', 'needing', 'large', 'finan

In [None]:
# 3.6. Find rare words (Ender Kelimeleri Bul)
from collections import Counter

word_counts = 