In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv('data/fake_news_dataset.csv')
df

Unnamed: 0,title,text,date,source,author,category,label
0,Foreign Democrat final.,more tax development both store agreement lawy...,2023-03-10,NY Times,Paula George,Politics,real
1,To offer down resource great point.,probably guess western behind likely next inve...,2022-05-25,Fox News,Joseph Hill,Politics,fake
2,Himself church myself carry.,them identify forward present success risk sev...,2022-09-01,CNN,Julia Robinson,Business,fake
3,You unit its should.,phone which item yard Republican safe where po...,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Austin Walker,Technology,fake
...,...,...,...,...,...,...,...
19995,House party born.,hit and television I change very our happy doo...,2024-12-04,BBC,Gary Miles,Entertainment,fake
19996,Though nation people maybe price box.,fear most meet rock even sea value design stan...,2024-05-26,Daily News,Maria Mcbride,Entertainment,real
19997,Yet exist with experience unit.,activity loss very provide eye west create wha...,2023-04-17,BBC,Kristen Franklin,Entertainment,real
19998,School wide itself item.,term point general common training watch respo...,2024-06-30,Reuters,David Wise,Health,fake


In [None]:
df.isna().sum()

title          0
text           0
date           0
source      1000
author      1000
category       0
label          0
dtype: int64

In [None]:
df['author'] = df['author'].fillna('unknown')
df['source'] = df['source'].fillna('unknown')

In [None]:
df.isna().sum()

title       0
text        0
date        0
source      0
author      0
category    0
label       0
dtype: int64

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     20000 non-null  object
 1   text      20000 non-null  object
 2   date      20000 non-null  object
 3   source    20000 non-null  object
 4   author    20000 non-null  object
 5   category  20000 non-null  object
 6   label     20000 non-null  object
dtypes: object(7)
memory usage: 1.1+ MB


In [None]:
df['date'] = pd.to_datetime(df['date'], errors='coerce')

In [None]:
df['full_text'] = df['title'].str.strip() + ' ' + df['text'].str.strip()

In [None]:
df

Unnamed: 0,title,text,date,source,author,category,label,full_text
0,Foreign Democrat final.,more tax development both store agreement lawy...,2023-03-10,NY Times,Paula George,Politics,real,Foreign Democrat final. more tax development b...
1,To offer down resource great point.,probably guess western behind likely next inve...,2022-05-25,Fox News,Joseph Hill,Politics,fake,To offer down resource great point. probably g...
2,Himself church myself carry.,them identify forward present success risk sev...,2022-09-01,CNN,Julia Robinson,Business,fake,Himself church myself carry. them identify for...
3,You unit its should.,phone which item yard Republican safe where po...,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake,You unit its should. phone which item yard Rep...
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Austin Walker,Technology,fake,Billion believe employee summer how. wonder my...
...,...,...,...,...,...,...,...,...
19995,House party born.,hit and television I change very our happy doo...,2024-12-04,BBC,Gary Miles,Entertainment,fake,House party born. hit and television I change ...
19996,Though nation people maybe price box.,fear most meet rock even sea value design stan...,2024-05-26,Daily News,Maria Mcbride,Entertainment,real,Though nation people maybe price box. fear mos...
19997,Yet exist with experience unit.,activity loss very provide eye west create wha...,2023-04-17,BBC,Kristen Franklin,Entertainment,real,Yet exist with experience unit. activity loss ...
19998,School wide itself item.,term point general common training watch respo...,2024-06-30,Reuters,David Wise,Health,fake,School wide itself item. term point general co...


In [None]:
import spacy
import re
from tqdm.auto import tqdm

In [None]:
tqdm.pandas()

In [None]:
nlp = spacy.load("en_core_web_sm")
nlp.disable_pipes("ner", "parser")

['ner', 'parser']

In [None]:
CUSTOM_STOP = {
    'mr', 'mrs', 'ms', 'build', 'leave', 'good', 'include', 'meet', 'feel', 'well', 'think', 'late', 'well', 'add',
    'hour', 'day', 'month', 'minute', 'set', 'late'
}

URL_RE = re.compile(r"http\S+|www\.\S+")
SPACE_RE = re.compile(r"\s+")

In [None]:
for w in CUSTOM_STOP:
    nlp.vocab[w].is_stop = True # расширяем stop_words из библиотеки

In [None]:
def normalize_text_en(s: str) -> str:
    s = "" if s is None else str(s)
    s = URL_RE.sub(" ", s)
    s = s.replace("\n", " ")
    s = SPACE_RE.sub(" ", s).strip()
    return s

In [None]:
def spacy_lemmas_pipe(texts, batch_size=512):
    texts = [normalize_text_en(t) for t in texts]
    out = []

    for doc in tqdm(nlp.pipe(texts, batch_size=batch_size), total=len(texts)):
        lemmas = []
        for t in doc:
            if t.is_space or t.is_punct or t.like_num:
                continue
            if t.is_stop:
                continue

            lemma = t.lemma_.lower().strip()

            if lemma in {"-pron-", ""}:
                continue
            if len(lemma) < 3:
                continue

            lemmas.append(lemma)
        out.append(lemmas)

    return out

df["tokens"] = spacy_lemmas_pipe(df["full_text"].astype(str).tolist(), batch_size=512)

  0%|          | 0/20000 [00:00<?, ?it/s]

In [None]:
df

Unnamed: 0,title,text,date,source,author,category,label,full_text,tokens
0,Foreign Democrat final.,more tax development both store agreement lawyer hear outside continue reach difference yeah figure your power fear identify there protect security great national nothing fast story why late nearly bit cost tough since question to power almost future young conference behind ahead building teach million box receive Mrs risk benefit month compare environment class imagine you vote community reason set once idea him answer many how purpose deep training game own true language garden of partner result face military discover discover data glass bed maintain test way development across top culture glass yes decision hope necessary as trade organization talk debate peace stay community development six wide write itself several fight teach billion for common fear we personal church establish store kind hundred debate hotel cut sister audience sound case that stay within information trouble be debate great themselves responsibility force people hundred bar miss others sometimes build room interesting however charge what especially north no especially us travel industry about including face ten behind black series place age soldier early trouble middle would along case what money significant sound song reason poor free want thank cultural range shoulder rest movie political fear hear past leader up edge professor determine law act change middle prove say notice travel open director argue economic seven game matter season,2023-03-10,NY Times,Paula George,Politics,real,Foreign Democrat final. more tax development both store agreement lawyer hear outside continue reach difference yeah figure your power fear identify there protect security great national nothing fast story why late nearly bit cost tough since question to power almost future young conference behind ahead building teach million box receive Mrs risk benefit month compare environment class imagine you vote community reason set once idea him answer many how purpose deep training game own true language garden of partner result face military discover discover data glass bed maintain test way development across top culture glass yes decision hope necessary as trade organization talk debate peace stay community development six wide write itself several fight teach billion for common fear we personal church establish store kind hundred debate hotel cut sister audience sound case that stay within information trouble be debate great themselves responsibility force people hundred bar miss others sometimes build room interesting however charge what especially north no especially us travel industry about including face ten behind black series place age soldier early trouble middle would along case what money significant sound song reason poor free want thank cultural range shoulder rest movie political fear hear past leader up edge professor determine law act change middle prove say notice travel open director argue economic seven game matter season,"[foreign, democrat, final, tax, development, store, agreement, lawyer, hear, outside, continue, reach, difference, yeah, figure, power, fear, identify, protect, security, great, national, fast, story, nearly, bit, cost, tough, question, power, future, young, conference, ahead, building, teach, box, receive, mrs, risk, benefit, compare, environment, class, imagine, vote, community, reason, idea, answer, purpose, deep, training, game, true, language, garden, partner, result, face, military, discover, discover, datum, glass, bed, maintain, test, way, development, culture, glass, yes, decision, hope, necessary, trade, organization, talk, debate, peace, stay, community, development, wide, write, fight, teach, common, fear, personal, church, establish, store, kind, debate, hotel, cut, sister, audience, ...]"
1,To offer down resource great point.,probably guess western behind likely next investment consumer range wrong exactly once attack shoulder movie partner daughter on executive tonight factor push development pass question field firm accept I represent answer computer win fast small character total myself air must difficult green fast writer adult though individual learn interview our available drug against group produce before large wish find even media nature then last computer project story special stand lead build during ball contain road since history customer garden figure kind throw tell discuss remain view morning put mouth while serve great certain free two structure skin yard position suffer fast someone ok mind must something outside position write theory ok letter for debate seat top fall authority bit deep there get man view loss bring friend free certain economic final occur summer similar best discover area real area still scientist social everybody front direction arrive open own down next lawyer baby already size stand put financial sister clear whether save into realize right break quickly music customer price prevent truth effort which probably strong friend everything also body together political interview least research benefit why dog mean near interest unit seek blood leader husband bring teacher age apply fill guess store south woman television worry build young style maybe agreement ability relate amount actually quite whose smile student current mother simply gun store Republican none when shoulder market admit knowledge animal majority product attorney approach on probably,2022-05-25,Fox News,Joseph Hill,Politics,fake,To offer down resource great point. probably guess western behind likely next investment consumer range wrong exactly once attack shoulder movie partner daughter on executive tonight factor push development pass question field firm accept I represent answer computer win fast small character total myself air must difficult green fast writer adult though individual learn interview our available drug against group produce before large wish find even media nature then last computer project story special stand lead build during ball contain road since history customer garden figure kind throw tell discuss remain view morning put mouth while serve great certain free two structure skin yard position suffer fast someone ok mind must something outside position write theory ok letter for debate seat top fall authority bit deep there get man view loss bring friend free certain economic final occur summer similar best discover area real area still scientist social everybody front direction arrive open own down next lawyer baby already size stand put financial sister clear whether save into realize right break quickly music customer price prevent truth effort which probably strong friend everything also body together political interview least research benefit why dog mean near interest unit seek blood leader husband bring teacher age apply fill guess store south woman television worry build young style maybe agreement ability relate amount actually quite whose smile student current mother simply gun store Republican none when shoulder market admit knowledge animal majority product attorney approach on probably,"[offer, resource, great, point, probably, guess, western, likely, investment, consumer, range, wrong, exactly, attack, shoulder, movie, partner, daughter, executive, tonight, factor, push, development, pass, question, field, firm, accept, represent, answer, computer, win, fast, small, character, total, air, difficult, green, fast, writer, adult, individual, learn, interview, available, drug, group, produce, large, wish, find, medium, nature, computer, project, story, special, stand, lead, ball, contain, road, history, customer, garden, figure, kind, throw, tell, discuss, remain, view, morning, mouth, serve, great, certain, free, structure, skin, yard, position, suffer, fast, mind, outside, position, write, theory, letter, debate, seat, fall, authority, bit, deep, man, view, loss, ...]"
2,Himself church myself carry.,them identify forward present success risk several front pull blood choose born prove we clear approach language election future plant other those yourself all thing side soon guy vote him should practice dream until find despite less artist minute although teacher social eye top less make back care thus much small act outside college because up travel continue night name military room himself instead many month follow long president community people like attention fall crime history despite fill recently need commercial investment address send religious join opportunity story but idea exactly back difference loss degree whose throughout lead response almost toward themselves card national structure state arm low threat property eat bill public trip bed note hair teach defense citizen my rather believe say level wall short religious theory hair respond town return discussion investment never success entire admit develop south ability television yard daughter while fire modern send suggest skin could outside work office protect determine teach structure door fund ready gun role everyone often father establish majority point set choice meet think treatment animal audience guess hear student other certain inside assume check approach senior body once condition just trial occur foot explain police certain kid into special share deal write southern nature exactly respond kid help cause manager TV ago word nor her care reality daughter find answer affect,2022-09-01,CNN,Julia Robinson,Business,fake,Himself church myself carry. them identify forward present success risk several front pull blood choose born prove we clear approach language election future plant other those yourself all thing side soon guy vote him should practice dream until find despite less artist minute although teacher social eye top less make back care thus much small act outside college because up travel continue night name military room himself instead many month follow long president community people like attention fall crime history despite fill recently need commercial investment address send religious join opportunity story but idea exactly back difference loss degree whose throughout lead response almost toward themselves card national structure state arm low threat property eat bill public trip bed note hair teach defense citizen my rather believe say level wall short religious theory hair respond town return discussion investment never success entire admit develop south ability television yard daughter while fire modern send suggest skin could outside work office protect determine teach structure door fund ready gun role everyone often father establish majority point set choice meet think treatment animal audience guess hear student other certain inside assume check approach senior body once condition just trial occur foot explain police certain kid into special share deal write southern nature exactly respond kid help cause manager TV ago word nor her care reality daughter find answer affect,"[church, carry, identify, forward, present, success, risk, pull, blood, choose, bear, prove, clear, approach, language, election, future, plant, thing, soon, guy, vote, practice, dream, find, despite, artist, teacher, social, eye, care, small, act, outside, college, travel, continue, night, military, room, instead, follow, long, president, community, people, like, attention, fall, crime, history, despite, fill, recently, need, commercial, investment, address, send, religious, join, opportunity, story, idea, exactly, difference, loss, degree, lead, response, card, national, structure, state, arm, low, threat, property, eat, bill, public, trip, bed, note, hair, teach, defense, citizen, believe, level, wall, short, religious, theory, hair, respond, town, return, discussion, investment, ...]"
3,You unit its should.,phone which item yard Republican safe where police identify either once participant not man human tough enough offer high imagine point police woman paper cover many reach service will likely president conference film agree discover moment positive help task share necessary story right finally compare traditional change for reason purpose single crime available point building wear speech about summer why senior couple somebody PM remember push less data hotel authority situation for much visit general society firm positive player play page miss brother window indeed energy lose stage perhaps itself range common story hot strong adult produce next carry guess television travel form meeting industry shoulder market sure certain parent walk husband behind cultural whatever collection difficult we team probably produce quickly health full white laugh represent religious line force I exist admit statement try by front short pattern baby open claim these chance face else way decade sing nature rich white bring employee catch time industry official family million camera some including everybody security wait art vote maybe rich detail sort another let forget product police third evidence end old throughout student discussion office put nature whatever figure sign nature population town team against arm war include need visit would wait small just bed I line school eight might bag official worker television condition so institution information full protect food fight attack include current even per chair accept reflect speak answer bag officer understand good weight money movement main traditional western information account adult even gas,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake,You unit its should. phone which item yard Republican safe where police identify either once participant not man human tough enough offer high imagine point police woman paper cover many reach service will likely president conference film agree discover moment positive help task share necessary story right finally compare traditional change for reason purpose single crime available point building wear speech about summer why senior couple somebody PM remember push less data hotel authority situation for much visit general society firm positive player play page miss brother window indeed energy lose stage perhaps itself range common story hot strong adult produce next carry guess television travel form meeting industry shoulder market sure certain parent walk husband behind cultural whatever collection difficult we team probably produce quickly health full white laugh represent religious line force I exist admit statement try by front short pattern baby open claim these chance face else way decade sing nature rich white bring employee catch time industry official family million camera some including everybody security wait art vote maybe rich detail sort another let forget product police third evidence end old throughout student discussion office put nature whatever figure sign nature population town team against arm war include need visit would wait small just bed I line school eight might bag official worker television condition so institution information full protect food fight attack include current even per chair accept reflect speak answer bag officer understand good weight money movement main traditional western information account adult even gas,"[unit, phone, item, yard, republican, safe, police, identify, participant, man, human, tough, offer, high, imagine, point, police, woman, paper, cover, reach, service, likely, president, conference, film, agree, discover, moment, positive, help, task, share, necessary, story, right, finally, compare, traditional, change, reason, purpose, single, crime, available, point, building, wear, speech, summer, senior, couple, somebody, remember, push, datum, hotel, authority, situation, visit, general, society, firm, positive, player, play, page, miss, brother, window, energy, lose, stage, range, common, story, hot, strong, adult, produce, carry, guess, television, travel, form, meeting, industry, shoulder, market, sure, certain, parent, walk, husband, cultural, collection, difficult, team, probably, produce, ...]"
4,Billion believe employee summer how.,wonder myself fact difficult course forget exactly pattern both sell training understand so ahead single western ago worry direction various agree first remember tonight year building agreement involve effect over even total game look need evidence particularly attorney agency apply produce theory deep ok fund relationship suffer guess put build morning quickly home authority physical choice can environment skin fill cost state force fire again establish recent two world style beyond bad while game memory world what hair anyone simply week try sport animal yeah hundred visit note within value various military laugh politics official between front would upon attack shoulder administration wish space receive less thing with structure produce cultural approach law doctor money interesting lawyer TV each activity however child item space sell movie south make around number camera past process fear wait total city site next party charge father up knowledge compare front attack check watch beat yard knowledge before quality field institution month child actually right become collection camera need include pattern after behind best ability expect possible star girl four radio establish charge budget suggest thousand she base become blood direction middle support fight game foot court reduce single whether or at score anything its sister management at focus machine single husband through free language run want shake goal food,2023-04-03,CNN,Austin Walker,Technology,fake,Billion believe employee summer how. wonder myself fact difficult course forget exactly pattern both sell training understand so ahead single western ago worry direction various agree first remember tonight year building agreement involve effect over even total game look need evidence particularly attorney agency apply produce theory deep ok fund relationship suffer guess put build morning quickly home authority physical choice can environment skin fill cost state force fire again establish recent two world style beyond bad while game memory world what hair anyone simply week try sport animal yeah hundred visit note within value various military laugh politics official between front would upon attack shoulder administration wish space receive less thing with structure produce cultural approach law doctor money interesting lawyer TV each activity however child item space sell movie south make around number camera past process fear wait total city site next party charge father up knowledge compare front attack check watch beat yard knowledge before quality field institution month child actually right become collection camera need include pattern after behind best ability expect possible star girl four radio establish charge budget suggest thousand she base become blood direction middle support fight game foot court reduce single whether or at score anything its sister management at focus machine single husband through free language run want shake goal food,"[believe, employee, summer, wonder, fact, difficult, course, forget, exactly, pattern, sell, training, understand, ahead, single, western, ago, worry, direction, agree, remember, tonight, year, building, agreement, involve, effect, total, game, look, need, evidence, particularly, attorney, agency, apply, produce, theory, deep, fund, relationship, suffer, guess, morning, quickly, home, authority, physical, choice, environment, skin, fill, cost, state, force, fire, establish, recent, world, style, bad, game, memory, world, hair, simply, week, try, sport, animal, yeah, visit, note, value, military, laugh, politic, official, attack, shoulder, administration, wish, space, receive, thing, structure, produce, cultural, approach, law, doctor, money, interesting, lawyer, activity, child, item, space, sell, movie, ...]"
...,...,...,...,...,...,...,...,...,...
19995,House party born.,hit and television I change very our happy door wide east performance run land most since ability bag live serious difficult meet amount leg particular wall sport cell billion participant society relationship before together good free vote century industry war cause run mind plan reflect place sea art of also eat true represent live time something effect usually size more draw civil manager particular month begin summer Democrat run main place special bed father walk anything too often direction machine manager fish far often its conference list imagine everyone some offer happen house ability attack result approach hospital everything American lay our cut western change truth firm line direction people set unit our skin class case morning office all raise military since win common way task approach indeed building purpose remain address race across wish lay our example should evidence product dog decade cold production home already left decide official half beyond language tree machine check here gun determine deal group lot begin my believe doctor magazine less employee real political kitchen question own learn number serious model billion answer education role force industry Mrs edge particularly pay later green on page at same radio believe kitchen each home hit good serious show commercial notice race role challenge bar stage over tree drop seven become firm individual amount help once spend four easy rate turn wide few investment wrong focus industry perform begin shoulder international might play action expert standard from direction day stay watch benefit news long toward technology movement other finally by majority who conference factor color market capital we force perform growth PM lose true investment while friend look cold until former gun mean late myself seem far thought science mention itself car impact success position moment fear race,2024-12-04,BBC,Gary Miles,Entertainment,fake,House party born. hit and television I change very our happy door wide east performance run land most since ability bag live serious difficult meet amount leg particular wall sport cell billion participant society relationship before together good free vote century industry war cause run mind plan reflect place sea art of also eat true represent live time something effect usually size more draw civil manager particular month begin summer Democrat run main place special bed father walk anything too often direction machine manager fish far often its conference list imagine everyone some offer happen house ability attack result approach hospital everything American lay our cut western change truth firm line direction people set unit our skin class case morning office all raise military since win common way task approach indeed building purpose remain address race across wish lay our example should evidence product dog decade cold production home already left decide official half beyond language tree machine check here gun determine deal group lot begin my believe doctor magazine less employee real political kitchen question own learn number serious model billion answer education role force industry Mrs edge particularly pay later green on page at same radio believe kitchen each home hit good serious show commercial notice race role challenge bar stage over tree drop seven become firm individual amount help once spend four easy rate turn wide few investment wrong focus industry perform begin shoulder international might play action expert standard from direction day stay watch benefit news long toward technology movement other finally by majority who conference factor color market capital we force perform growth PM lose true investment while friend look cold until former gun mean late myself seem far thought science mention itself car impact success position moment fear race,"[house, party, bear, hit, television, change, happy, door, wide, east, performance, run, land, ability, bag, live, difficult, leg, particular, wall, sport, cell, participant, society, relationship, free, vote, century, industry, war, cause, run, mind, plan, reflect, place, sea, art, eat, true, represent, live, time, effect, usually, size, draw, civil, manager, particular, begin, summer, democrat, run, main, place, special, bed, father, walk, direction, machine, manager, fish, far, conference, list, imagine, offer, happen, house, ability, attack, result, approach, hospital, american, lie, cut, western, change, truth, firm, line, direction, people, unit, skin, class, case, morning, office, raise, military, win, common, way, task, approach, build, ...]"
19996,Though nation people maybe price box.,fear most meet rock even sea value design standard mother cut well only realize remember hair three outside good thing personal enter painting let mention share hold accept what day behavior left vote later build move look kind PM card military major unit son necessary east sea story measure event model full artist degree receive speak field live research few whom everything boy herself president star treat Congress anything finally only able ago student various think career point campaign leg government history language four can there foreign material century her mission put family each sell role cultural line red stay ball employee government good start site eat not enough different those bag behind six team pass put make if when stage test point whatever development your near under address effort song hospital under believe energy activity country public two collection state strong late today senior live civil paper market truth attention wrong street yet direction last decade security dinner stay after account high soldier daughter mission coach seven structure morning activity get follow eight water in various very talk people young seem cover share site send write bad morning or price art none after view statement expect stop many task federal improve difficult herself inside professional course likely offer wrong popular well act year experience time since board dream should carry ten though wait serve outside inside table operation effort compare truth person wish last professor time wear reveal family have blood state itself real two woman debate seat price nothing happen protect born try decade military method,2024-05-26,Daily News,Maria Mcbride,Entertainment,real,Though nation people maybe price box. fear most meet rock even sea value design standard mother cut well only realize remember hair three outside good thing personal enter painting let mention share hold accept what day behavior left vote later build move look kind PM card military major unit son necessary east sea story measure event model full artist degree receive speak field live research few whom everything boy herself president star treat Congress anything finally only able ago student various think career point campaign leg government history language four can there foreign material century her mission put family each sell role cultural line red stay ball employee government good start site eat not enough different those bag behind six team pass put make if when stage test point whatever development your near under address effort song hospital under believe energy activity country public two collection state strong late today senior live civil paper market truth attention wrong street yet direction last decade security dinner stay after account high soldier daughter mission coach seven structure morning activity get follow eight water in various very talk people young seem cover share site send write bad morning or price art none after view statement expect stop many task federal improve difficult herself inside professional course likely offer wrong popular well act year experience time since board dream should carry ten though wait serve outside inside table operation effort compare truth person wish last professor time wear reveal family have blood state itself real two woman debate seat price nothing happen protect born try decade military method,"[nation, people, maybe, price, box, fear, rock, sea, value, design, standard, mother, cut, realize, remember, hair, outside, thing, personal, enter, painting, let, mention, share, hold, accept, behavior, leave, vote, later, look, kind, card, military, major, unit, son, necessary, east, sea, story, measure, event, model, artist, degree, receive, speak, field, live, research, boy, president, star, treat, congress, finally, able, ago, student, career, point, campaign, leg, government, history, language, foreign, material, century, mission, family, sell, role, cultural, line, red, stay, ball, employee, government, start, site, eat, different, bag, team, pass, stage, test, point, development, near, address, effort, song, hospital, believe, energy, activity, ...]"
19997,Yet exist with experience unit.,activity loss very provide eye west create whatever walk trial next school whose power part article finish discuss hear similar lose along draw protect drive particular change civil chair start make short line likely necessary single sport music performance on budget those ago cover message blue claim six my happy provide tend field point will head other see career bar old do reach week sit produce special there long effort program value son pull camera point fall successful growth medical base career hair common task risk all money among will buy big open indeed apply according pull life actually law fish believe produce player hand most conference toward watch civil last kid because approach oil matter avoid anything top also western process many base vote ahead whatever friend bad place guy campaign your simple identify none alone system try memory simple deep activity indicate center meet nothing less peace term federal could probably effort few stuff technology PM hospital pretty policy safe off next thing sing federal grow rich you picture model race almost big individual it white why ever build without against movie prove forget present better arrive begin onto institution assume unit fish statement opportunity small candidate word rather station own score night around remember strategy cover argue why bring politics project who value involve mind enjoy but better determine total center station detail try all fact ago need camera by trade impact summer real star yet,2023-04-17,BBC,Kristen Franklin,Entertainment,real,Yet exist with experience unit. activity loss very provide eye west create whatever walk trial next school whose power part article finish discuss hear similar lose along draw protect drive particular change civil chair start make short line likely necessary single sport music performance on budget those ago cover message blue claim six my happy provide tend field point will head other see career bar old do reach week sit produce special there long effort program value son pull camera point fall successful growth medical base career hair common task risk all money among will buy big open indeed apply according pull life actually law fish believe produce player hand most conference toward watch civil last kid because approach oil matter avoid anything top also western process many base vote ahead whatever friend bad place guy campaign your simple identify none alone system try memory simple deep activity indicate center meet nothing less peace term federal could probably effort few stuff technology PM hospital pretty policy safe off next thing sing federal grow rich you picture model race almost big individual it white why ever build without against movie prove forget present better arrive begin onto institution assume unit fish statement opportunity small candidate word rather station own score night around remember strategy cover argue why bring politics project who value involve mind enjoy but better determine total center station detail try all fact ago need camera by trade impact summer real star yet,"[exist, experience, unit, activity, loss, provide, eye, west, create, walk, trial, school, power, article, finish, discuss, hear, similar, lose, draw, protect, drive, particular, change, civil, chair, start, short, line, likely, necessary, single, sport, music, performance, budget, ago, cover, message, blue, claim, happy, provide, tend, field, point, head, career, bar, old, reach, week, sit, produce, special, long, effort, program, value, son, pull, camera, point, fall, successful, growth, medical, base, career, hair, common, task, risk, money, buy, big, open, apply, accord, pull, life, actually, law, fish, believe, produce, player, hand, conference, watch, civil, kid, approach, oil, matter, avoid, western, process, base, vote, ...]"
19998,School wide itself item.,term point general common training watch responsibility process a author large toward enjoy thank family popular eat move election western his science sort money truth learn peace property difference article group service know art raise job commercial change citizen college these summer way process question move purpose direction value what west military fund defense last hope phone visit yet compare hundred every medical forget sign serve easy able admit southern high difficult happy discussion form current our Mrs suggest fall laugh teach chair her food bring government such on sing able information activity rich worry month dream follow watch close current kitchen month store two billion against season body best than for tend nature grow source growth adult practice image range tonight true little thought travel score federal baby plant front before toward seat specific size minute great local send ago wide suffer lawyer public final be above meet start especially audience song song economy often heart artist sea open outside low bar city when seem detail effect carry yard growth white others energy year floor issue himself base by Mr prevent without west wall product catch PM foot child source win mean myself citizen eat debate much rise feel understand design know painting determine your himself century beyond former fall piece for attack strategy product side administration company another expect data leg moment tree sit must guy federal company door myself movie issue lot chance end why recently call house wall with company take,2024-06-30,Reuters,David Wise,Health,fake,School wide itself item. term point general common training watch responsibility process a author large toward enjoy thank family popular eat move election western his science sort money truth learn peace property difference article group service know art raise job commercial change citizen college these summer way process question move purpose direction value what west military fund defense last hope phone visit yet compare hundred every medical forget sign serve easy able admit southern high difficult happy discussion form current our Mrs suggest fall laugh teach chair her food bring government such on sing able information activity rich worry month dream follow watch close current kitchen month store two billion against season body best than for tend nature grow source growth adult practice image range tonight true little thought travel score federal baby plant front before toward seat specific size minute great local send ago wide suffer lawyer public final be above meet start especially audience song song economy often heart artist sea open outside low bar city when seem detail effect carry yard growth white others energy year floor issue himself base by Mr prevent without west wall product catch PM foot child source win mean myself citizen eat debate much rise feel understand design know painting determine your himself century beyond former fall piece for attack strategy product side administration company another expect data leg moment tree sit must guy federal company door myself movie issue lot chance end why recently call house wall with company take,"[school, wide, item, term, point, general, common, training, watch, responsibility, process, author, large, enjoy, thank, family, popular, eat, election, western, science, sort, money, truth, learn, peace, property, difference, article, group, service, know, art, raise, job, commercial, change, citizen, college, summer, way, process, question, purpose, direction, value, west, military, fund, defense, hope, phone, visit, compare, medical, forget, sign, serve, easy, able, admit, southern, high, difficult, happy, discussion, form, current, mrs, suggest, fall, laugh, teach, chair, food, bring, government, sing, able, information, activity, rich, worry, dream, follow, watch, close, current, kitchen, store, season, body, well, tend, nature, grow, source, growth, adult, practice, ...]"


In [None]:
from gensim.corpora import Dictionary
from gensim.models import TfidfModel

In [None]:
dictionary = Dictionary(df['tokens'])
dictionary.filter_extremes(no_below=5, no_above=0.5)
dictionary.compactify()

In [None]:
bow_all = [dictionary.doc2bow(toks) for toks in df["tokens"]]

In [None]:
tfidf_model = TfidfModel(bow_all, normalize=True)

In [None]:
from collections import defaultdict

In [None]:
def class_specific_tfidf(
    df,
    tokens_col,
    group_col,
    dictionary,
    tfidf_model,
    topn=10,
    min_docs=10
):
    rows = []

    for group_value, part in df.groupby(group_col):
        if len(part) < min_docs:
            continue

        acc = defaultdict(float)

        # TF считаем только внутри группы
        for toks in part[tokens_col]:
            bow = dictionary.doc2bow(toks)
            tfidf_doc = tfidf_model[bow]
            for term_id, w in tfidf_doc:
                acc[term_id] += w

        top = sorted(acc.items(), key=lambda x: x[1], reverse=True)[:topn]
        words = [dictionary[term_id] for term_id, _ in top]

        rows.append((group_value, words))

    return pd.DataFrame(rows, columns=[group_col, f"top{topn}_words"])


In [None]:
pd.set_option('display.max_colwidth', None)

In [None]:
top_category_distinct = class_specific_tfidf(
    df=df,
    tokens_col="tokens",
    group_col="category",
    dictionary=dictionary,
    tfidf_model=tfidf_model,
    topn=10,
    min_docs=20
)

top_category_distinct

Unnamed: 0,category,top10_words
0,Business,"[local, entire, listen, well, vote, president, remain, bear, pay, source]"
1,Entertainment,"[well, check, manage, season, door, evening, near, ability, sense, stand]"
2,Health,"[write, poor, available, long, grow, evidence, well, different, expect, reason]"
3,Politics,"[fire, well, small, memory, rate, way, history, summer, interesting, cup]"
4,Science,"[image, writer, garden, able, activity, well, certain, common, appear, degree]"
5,Sports,"[inside, politic, speak, goal, attention, major, environmental, town, owner, sell]"
6,Technology,"[well, usually, present, doctor, cultural, hit, sport, believe, speech, story]"


In [None]:
top_label_distinct = class_specific_tfidf(
    df=df,
    tokens_col="tokens",
    group_col="label",
    dictionary=dictionary,
    tfidf_model=tfidf_model,
    topn=10,
    min_docs=20
)

display(top_label_distinct)


Unnamed: 0,label,top10_words
0,fake,"[well, moment, claim, significant, offer, wide, particular, rich, subject, summer]"
1,real,"[well, economy, reflect, remain, play, member, drive, produce, girl, bring]"


In [None]:
fake_words = set(top_label_distinct[top_label_distinct["label"]=="fake"]["top10_words"].iloc[0])
real_words = set(top_label_distinct[top_label_distinct["label"]=="real"]["top10_words"].iloc[0])

print("Intersection fake & real:", fake_words & real_words)


Intersection fake & real: {'well'}


In [None]:
author_stats = (
    df.groupby("author")
      .agg(
          n=("label", "size"),
          fake_n=("label", lambda x: (x == "fake").sum()),
          real_n=("label", lambda x: (x == "real").sum()),
      )
      .reset_index()
)

author_stats["fake_rate"] = author_stats["fake_n"] / author_stats["n"]
author_stats["real_rate"] = author_stats["real_n"] / author_stats["n"]

MIN_N = 5 # чтобы не ловить шум по авторам с 1-2 статьями
author_stats_f = author_stats[author_stats["n"] >= MIN_N].copy()

In [None]:
author_stats_f.sort_values("fake_rate", ascending=False).head(15) # "фейковые" авторы

Unnamed: 0,author,n,fake_n,real_n,fake_rate,real_rate
6675,James Smith,7,6,1,0.857143,0.142857
11868,Michael Williams,5,4,1,0.8,0.2
4163,David Jones,5,4,1,0.8,0.2
9623,Kimberly Smith,5,4,1,0.8,0.2
11242,Matthew Smith,5,4,1,0.8,0.2
7836,John Brown,7,5,2,0.714286,0.285714
4126,David Garcia,6,4,2,0.666667,0.333333
11874,Michael Young,5,3,2,0.6,0.4
1328,Ashley Davis,5,3,2,0.6,0.4
11592,Michael Brown,5,3,2,0.6,0.4


In [None]:
display(author_stats_f.sort_values("fake_rate", ascending=True).head(15)) # "правдивые" авторы

Unnamed: 0,author,n,fake_n,real_n,fake_rate,real_rate
1171,Anthony Jones,5,1,4,0.2,0.8
14242,Ryan Smith,5,1,4,0.2,0.8
7257,Jennifer Miller,5,1,4,0.2,0.8
5177,Elizabeth Johnson,5,1,4,0.2,0.8
7514,Jessica Brown,5,1,4,0.2,0.8
7978,John Smith,11,3,8,0.272727,0.727273
11718,Michael Lee,7,2,5,0.285714,0.714286
11828,Michael Smith,12,4,8,0.333333,0.666667
1412,Ashley Smith,6,2,4,0.333333,0.666667
8455,Joshua Smith,5,2,3,0.4,0.6


In [None]:
# есть ли категории, где больше fake
ct = pd.crosstab(df["category"], df["label"], normalize="index")
ct["n"] = df["category"].value_counts()
ct = ct[ct["n"] >= 50].sort_values("fake", ascending=False)
ct

label,fake,real,n
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Business,0.516322,0.483678,2849
Health,0.507187,0.492813,2922
Entertainment,0.505365,0.494635,2889
Sports,0.503662,0.496338,2867
Politics,0.499286,0.500714,2802
Technology,0.494101,0.505899,2882
Science,0.493367,0.506633,2789


### Логистическая регрессия

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

In [None]:
#TfidfVectorizer ожидает строки, поэтому мы просто склеиваем токены
df["text_for_model"] = df["tokens"].apply(lambda x: " ".join(x))

In [None]:
tfidf = TfidfVectorizer(
    tokenizer=str.split, # используем готовые токены
    preprocessor=None,
    lowercase=False, # уже приведено к lower
    min_df=5,
    max_df=0.8,
    ngram_range=(1, 2), # биграммы
    sublinear_tf=True
)

In [None]:
X = tfidf.fit_transform(df["text_for_model"])
y = (df["label"] == "fake").astype(int).values



In [None]:
feature_names = tfidf.get_feature_names_out()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

In [None]:
clf = LogisticRegression(
    max_iter=3000,
    n_jobs=-1,
    class_weight="balanced"
)

clf.fit(X_train, y_train)

In [None]:
pred = clf.predict(X_test)
proba = clf.predict_proba(X_test)[:, 1]

print(classification_report(y_test, pred, target_names=["real", "fake"]))
print("ROC-AUC:", roc_auc_score(y_test, proba))
print("Confusion matrix:\n", confusion_matrix(y_test, pred))

              precision    recall  f1-score   support

        real       0.50      0.50      0.50      1989
        fake       0.51      0.51      0.51      2011

    accuracy                           0.51      4000
   macro avg       0.51      0.51      0.51      4000
weighted avg       0.51      0.51      0.51      4000

ROC-AUC: 0.5013519158954558
Confusion matrix:
 [[ 993  996]
 [ 979 1032]]


In [None]:
coef = clf.coef_.ravel()

top_fake_idx = np.argsort(coef)[-20:][::-1]
top_real_idx = np.argsort(coef)[:20]

top_fake_words = [feature_names[i] for i in top_fake_idx]
top_real_words = [feature_names[i] for i in top_real_idx]

print("Топ-слов, которые толкают в FAKE:")
print(top_fake_words)

print("\nТоп-слов, которые толкают в REAL:")
print(top_real_words)

Топ-слов, которые толкают в FAKE:
['significant', 'particular', 'model', 'wrong', 'standard', 'far', 'fire', 'role', 'animal', 'site', 'claim', 'agency', 'rich', 'moment', 'modern', 'boy', 'trade', 'mother', 'seek', 'policy']

Топ-слов, которые толкают в REAL:
['economy', 'bring', 'big', 'reflect', 'girl', 'example', 'point', 'radio', 'play', 'dark', 'great', 'politic', 'sister', 'bar', 'weight', 'son', 'general', 'nice', 'congress', 'head']


Вывод: точность модели - рандом. Но учитывая, что EDA проводился достаточно скурпулёзный (несколькими способами), и это никак не влияло на качество модели существенно (всё равно рандом) + в ходе анализа визуально даже понятно, что фейковые и реальные новости очень тяжело различить, нет у них каких-то явных особенностей, зацепок, которые бы подсказывали. Наверное, это говорит о том, что из того, что нам дано в исходном датасете автоматизировать классификацию новостей на правду/фейк не представляется возможным (хорошо врут!).

# Выводы (девочки делали на занятии)

## 1. Предобработка текста
Для подготовки текстовых данных были выполнены следующие шаги:
- объединение заголовка и основного текста в единое текстовое поле;
- токенизация и лемматизация английского текста с использованием библиотеки **spaCy**;
- удаление пунктуации, чисел и стоп-слов;
- фильтрация слишком коротких токенов;
- формирование списка лемматизированных токенов для каждого документа.

Такая предобработка позволила привести тексты к нормализованному виду и снизить уровень шума.

---

## 2. Разведочный анализ (EDA)

### 2.1 Анализ по категориям
Для анализа лексических особенностей категорий был применён подход **class-specific TF-IDF**, при котором:
- TF рассчитывался внутри каждой категории,
- IDF — по всему корпусу.

Результаты показали, что в разных категориях часто встречаются схожие общеязыковые слова, характерные для новостного стиля в целом. Это указывает на слабую лексическую дифференциацию категорий на уровне отдельных слов.

### 2.2 Анализ fake / real
Аналогичный class-specific TF-IDF анализ для классов *fake* и *real* также показал значительное пересечение топ-слов между классами. Это означает, что:
- наиболее частотные и значимые слова корпуса являются общими для обоих классов,
- TF-IDF выявляет важные слова внутри класса, но не оптимизирован для поиска дискриминативных признаков между классами.

---

## 3. Анализ авторов и метаданных
Был выполнен анализ авторов с подсчётом количества публикаций и доли фейковых новостей (с фильтрацией по минимальному числу статей). Также была построена кросс-таблица `category × label`.

Полученные результаты не выявили устойчивых и однозначных закономерностей, позволяющих надёжно определять фейковость новости только по автору или категории, что характерно для учебных или синтетических датасетов.

---

## 4. Модель классификации

Для классификации была использована модель:
- **TF-IDF (scikit-learn)**
- **Logistic Regression**

Модель обучалась на текстовых признаках без использования метаданных.

### Результаты:
- ROC-AUC ≈ **0.50**
- Качество классификации соответствует уровню случайного угадывания.

Это свидетельствует о том, что в текущем датасете отсутствует сильная и устойчивая связь между текстовым содержанием и меткой fake / real, либо эта связь не выражена в используемом представлении текста.

---

## 5. Интерпретация результатов
Низкое качество модели не является ошибкой реализации. Напротив, полученные результаты позволяют сделать важные методологические выводы:
- модель не использует утечки данных и не «читерит» за счёт метаданных;
- текстовые признаки после предобработки не содержат достаточного сигнала для уверенной классификации.

---

## 6. Ограничения и возможные улучшения
- Проверка качества и происхождения разметки (учебный/синтетический характер данных).
- Ослабление фильтров предобработки текста и сравнение с baseline-моделью на сыром тексте.
- Применение более сложных моделей при наличии реальных, качественно размеченных данных.
