About the Dataset:

1. id: bir haber makalesi için benzersiz kimlik
2. başlık: bir haber makalesinin başlığı
3. yazar: haber makalesinin yazarı
4. metin: makalenin metni; eksik olabilir
5. etiket: haber makalesinin gerçek mi yoksa sahte mi olduğunu gösteren bir etiket
           1: Fake news
           0: real News






Importing the Dependencies

In [53]:
import numpy as np
import pandas as pd
import re                        #   Metin içindeki belirli desenleri (örneğin, belirli bir kelimeyi, numarayı vb.) bulmak ve bu desenlere göre metni manipüle etmek için kullanılır.
from nltk.corpus import stopwords  # Stopwords, doğal dil işleme sırasında genellikle anlamsız kabul edilen yaygın kelimelerdir (örneğin, "the", "is", "in" gibi İngilizce kelimeler).Bunları temizlemek için kullanılır
from nltk.stem.porter import PorterStemmer    # PorterStemmer, kelimelerin kök haline (stem) indirilmesi için kullanılan bir algoritmadır. Örneğin, "running" kelimesi "run" köküne indirgenir.
from sklearn.feature_extraction.text import TfidfVectorizer   # TF-IDF (Term Frequency-Inverse Document Frequency) yöntemiyle metin verilerini sayısal verilere dönüştürür. Kelime sıklığını ve kelimenin önemli olup olmadığını belirler.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [54]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [55]:
# İngilizce durak kelimelerin yazdırılması
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Pre-processing

In [60]:
# veri kümesini bir pandas DataFrame'e yükleme

news_dataset = pd.read_csv('/content/train.csv', encoding='iso-8859-1', on_bad_lines='skip')


In [61]:
news_dataset.shape

(34534, 5)

In [62]:
# veri çerçevesinin ilk 5 satırını yazdırır
news_dataset.head(8)

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didnât Even See Comeyâs...,Darrell Lucus,House Dem Aide: We Didnât Even See Comeyâs...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
5,5,Jackie Mason: Hollywood Would Love Trump if He...,Daniel Nussbaum,"In these trying times, Jackie Mason is the Voi...",0
6,6,Life: Life Of Luxury: Elton Johnâs 6 Favorit...,,Ever wonder how Britainâs most iconic pop pi...,1
7,7,BenoÃ®t Hamon Wins French Socialist Partyâs ...,Alissa J. Rubin,"PARIS â France chose an idealistic, tradi...",0


In [63]:
# veri kümesindeki eksik değerlerin sayısını sayma
news_dataset.isnull().sum()

Unnamed: 0,0
id,0
title,930
author,3254
text,90
label,34


In [64]:
#  null değerleri boş dize ile değiştirme
news_dataset = news_dataset.fillna('')

In [33]:
news_dataset.head()

Unnamed: 0,Unnamed: 1,id,title,author,text,label
0,House Dem Aide: We Didnât Even See Comeyâs Letter Until Jason Chaffetz Tweeted It,Darrell Lucus,"""House Dem Aide: We Didnât Even See Comeyâ...",2016 Subscribe Jason Chaffetz on the stump in...,Utah ( image courtesy Michael Jolley,available under a Creative Commons-BY license)
With apologies to Keith Olbermann,there is no doubt who the Worst Person in The World is this weekâFBI Director James Comey. But according to a House Democratic aide,it looks like we also know who the second-wor...,the ranking Democrats on the relevant committ...,,,
As we now know,Comey notified the Republican chairmen and Democratic ranking members of the House Intelligence,Judiciary,and Oversight committees that his agency was ...,Oversight Committee Chairman Jason Chaffetz s...,"""""The FBI has learned of the existence of ema...",
â Jason Chaffetz (@jasoninthehouse) October 28,2016,,,,,
Of course,we now know that this was not the case . Comey was actually saying that it was reviewing the emails in light of âan unrelated caseââwhich we now know to be Anthony Weinerâs sexting with a teenager. But apparently such little things as facts didnât matter to Chaffetz. The Utah Republican had already vowed to initiate a raft of investigations if Hillary winsâat least two yearsâ worth,and possibly an entire termâs worth of them...,,,,


In [65]:
# yazar adı ve haber başlığının birleştirilmesi
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [66]:
print(news_dataset['content'])

0        Darrell Lucus House Dem Aide: We Didnât Even...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
34529    Jerome Hudson Rapper T.I.: Trump a âPoster C...
34530    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
34531    Michael J. de la Merced and Rachel Abrams Macy...
34532    Alex Ansary NATO, Russia To Hold Parallel Exer...
34533              David Swanson What Keeps the F-35 Alive
Name: content, Length: 34534, dtype: object


In [67]:
# # veri ve etiketi ayırma
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [69]:
print(Y)
print(X)

0        1
1        0
2        1
3        1
4        1
        ..
34529    0
34530    0
34531    0
34532    1
34533    1
Name: label, Length: 34534, dtype: object
          id                                              title  \
0          0  House Dem Aide: We Didnât Even See Comeyâs...   
1          1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2          2                  Why the Truth Might Get You Fired   
3          3  15 Civilians Killed In Single US Airstrike Hav...   
4          4  Iranian woman jailed for fictional unpublished...   
...      ...                                                ...   
34529  20795  Rapper T.I.: Trump a âPoster Child For White...   
34530  20796  N.F.L. Playoffs: Schedule, Matchups and Odds -...   
34531  20797  Macyâs Is Said to Receive Takeover Approach ...   
34532  20798  NATO, Russia To Hold Parallel Exercises In Bal...   
34533  20799                          What Keeps the F-35 Alive   

                                

Köklendirme:

Stemming, bir kelimeyi Kök kelimesine indirgeme işlemidir

örnek: aktör, aktris, oyunculuk --> act

In [70]:
port_stem = PorterStemmer()

Fonksiyonun Çalışma Mantığı
Temizleme (re.sub):

re.sub('[^a-zA-Z]', ' ', content): Metindeki yalnızca harfleri korur, diğer tüm karakterleri boşlukla değiştirir.
Bu adım, metni yalnızca harf karakterleri içerecek şekilde temizler. Sayılar, noktalama işaretleri gibi karakterler kaldırılır.
Küçük Harfe Çevirme (lower):

stemmed_content.lower(): Metindeki tüm harfleri küçük harfe dönüştürür.
Bu, büyük/küçük harf farkını ortadan kaldırarak işlemlerin tutarlılığını sağlar.
Kelimeye Bölme (split):

stemmed_content.split(): Metni kelimelerine böler ve her bir kelimeyi bir listeye koyar.
Bu adımda, metin artık kelimeler listesi haline gelir.
Kök Bulma (stemming):

port_stem.stem(word): Her kelimenin kökünü bulur. port_stem bir PorterStemmer nesnesidir.
if not word in stopwords.words('english'): İngilizce'deki yaygın durak (stop) kelimeleri çıkarır (örneğin "the", "is", "and").
Bu adım, kelimeleri en basit hallerine indirger (örneğin, "running" → "run").
Birleştirme (join):

' '.join(stemmed_content): Köklenmiş kelimeleri tekrar bir araya getirir ve tek bir metin haline getirir.
Bu adım sonunda, işlemlerden geçmiş ve köklenmiş kelimelerle oluşturulmuş yeni bir metin elde edilir.


In [71]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [72]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [81]:
print(news_dataset['content'])

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
34529    jerom hudson rapper trump poster child white s...
34530    benjamin hoffman n f l playoff schedul matchup...
34531    michael j de la merc rachel abram maci said re...
34532    alex ansari nato russia hold parallel exercis ...
34533                            david swanson keep f aliv
Name: content, Length: 34534, dtype: object


In [82]:
#verileri ve etiketi ayırma
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [83]:
print(X)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


In [84]:
print(Y)

['1' '0' '1' ... '0' '1' '1']


In [85]:
Y.shape

(34534,)

In [86]:
# # metinsel verileri sayısal verilere dönüştürme
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [87]:
print(X)

  (0, 15706)	0.28455523794195114
  (0, 13490)	0.25521446694382066
  (0, 8920)	0.37003336476394333
  (0, 8641)	0.29058386028917865
  (0, 7703)	0.24660109621251788
  (0, 7012)	0.2178760420310972
  (0, 4977)	0.23222751670272004
  (0, 3794)	0.27063475013300203
  (0, 3602)	0.36473523585594847
  (0, 2961)	0.2462965640730307
  (0, 2483)	0.3623157324977447
  (0, 267)	0.2685309738578144
  (1, 16823)	0.3031127497822591
  (1, 6823)	0.19062444529015957
  (1, 5509)	0.7131569798778061
  (1, 3570)	0.2621182973871198
  (1, 2815)	0.19110653098410563
  (1, 2223)	0.38282831679867374
  (1, 1894)	0.15583395530167127
  (1, 1497)	0.29516546893404155
  (2, 15631)	0.41434683446335924
  (2, 9633)	0.4924308462108994
  (2, 5974)	0.34864310213419314
  (2, 5395)	0.387306866507998
  (2, 3105)	0.4616424218263026
  :	:
  (34531, 13139)	0.24298931090938686
  (34531, 12361)	0.2726027609874116
  (34531, 12155)	0.24794938945362588
  (34531, 10319)	0.08054951950593627
  (34531, 9601)	0.17482499493139575
  (34531, 9530)	0.2

Splitting the dataset to training & test data

In [109]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state=2)

Training the Model: Logistic Regression

In [110]:
model = LogisticRegression()

In [111]:
model.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluation

accuracy score

In [112]:
# eğitim verileri üzerinde doğruluk puanı
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [172]:
print(' eğitim verileri üzerinde doğruluk puanı : ', training_data_accuracy)

 eğitim verileri üzerinde doğruluk puanı :  0.9929416874796395


In [114]:
#  test verileri üzerinde doğruluk puanı
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [173]:
print(' eğitim verileri üzerinde doğruluk puanı : ', test_data_accuracy)

 eğitim verileri üzerinde doğruluk puanı :  0.9907340379325322


Making a Predictive System

In [174]:
# Tek bir örneği modelin tahmin fonksiyonuna uygun formatta yeniden şekillendirin
X_new = X_test[12]  # Tek örnek

# Model tahmini yapın
prediction = model.predict(X_new)  # Tek örneği liste içine koyarak tahmin yapın

print(prediction)  # Tahmin sonucunu kontrol edin

# Tahmin sonucunu kontrol edin
if prediction[0] == 0:
    print('The news is Fake')
else:
    print('The news is Real')

['0']
The news is Real


In [177]:
print(Y_test[12])

0
